-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update database benchmarks to take trie cache into account #6131
Comments
Sorry did not see the ping, the code is here:
You can use it from the polkadot or polkadot-omni-node binary: |
Do you mostly need this for Polkadot+Kusama AssetHub or something else as well? |
Yes, it is needed only for AssetHub. But it would be good to also have those weights in polkadot-sdk so we can test on westend AssetHub. I would assume we add some This flag would be the same as
Would be awesome if the runtime somehow could advise the client which of those flags to set. But I guess this is not in the cards. |
I would break this issue into several parts:
POC:
I’m not entirely sure I understand your point. If you mean setting I believe there’s an alternative approach where the runtime determines the node’s behavior, rather than the other way around. For example, the runtime specifies Am I missing something here, or does this approach make sense? |
This is why I wrote you are only supposed to set this value when you made sure your nodes all start with said cli switch. It is a convention. Just like all the rest of the weights (you need to have a certain hardware spec).
Yes this would be a good safe guard. But I wouldn't set it automatically. Rather I would error out if the command is not set or a warning that needs to be overwritten. It is still a heuristic. Regarding your plan. I would expect to see "changing the benchmarks to measure in-memory access" somewhere. We also need to find out of the trie cache is write back or write through. If the former is the case the cache is probably too slow. Benchmarking will reveal that. |
4.
5. Investigate trie cache is write back or write through Ok, let's say we release new polkadot-fellows runtime AssetHubPolkadot/AssetHubKusama/Westend with benchmarked
Could you please elaborate on this, specifically on "when you made sure your nodes all start with the said CLI switch" and "your nodes"? How would you achieve or ensure this for AssetHubPolkadot/AssetHubKusama? I mean, anyone can run a node/collator with a different node version and/or without the |
The on disk format shouldn't be relevant as it will not hit rocksdb synchronously (assuming write back cache). Maybe
Yeah it is a good idea to inspect how the trie cache scales with size.
How would you safeguard against somebody running their node on a slow computer? You can't. It is exactly the same thing here. We are raising the hardware requirements essentially. So we communicate this new requirement and any safeguard we can add is nice. But not required. We then update the runtime after enough people upgraded their node (but at least all collators).
Light clients and validator nodes are not affected because they are stateless. Collators and full nodes are affected. |
@athei I played a bit with
Notes/Hints:
no trie cache
default trie cache: (67108864 Bytes)
1GB trie cache
|
This does not look very promising. A cached vs disk access should be a much bigger difference. Probably those benchmarks are not measuring the right thing. Have you looked into those benchmarks that Basti wrote for the trie cache? |
Indeed, I was puzzled by the same thing in our Spammening runs. |
Were the collators running into timeout or weight limit? @serban300 Is taking this over. Thank you. Keep in mind that this is a very involved issue. The benchmarks we have probably measure the wrong thing. We would need to take into account ssd cache, OS cache and all those things. Maybe even come up with a macro benchmarks instead of the micro benchmarks we are doing. It also requires understanding the whole stack and what exactly the trie cache is doing. Let me state the objective of that issue a bit more high level: We want to put the whole (AssetHub) state in RAM and have the weights reflect that. Completely eradicating disk access form the equation. Having a state sized trie cache and warming that up is the assumed best short term solution for that. It could be that this assumption is wrong. |
@athei We are also investigating this. Let's not duplicate the effort :) We have this document to track the work being done and progress: https://hackmd.io/KKFGKcmXTom84ot32rSKmw?both |
I'm quite sure there is a bug. If you run my benchmarks, the trie cache variant is much slower than back in the days. It is actually slower than the version using the db :D Something broke, but I don't know yet what. |
So, I looked into it. I have identified multiple issues, but first let's start with the benchmark. (You can replicate them by using the With trie cache:
Without trie cache:
So, in average there is a 7x speed up. I think back in the days I had a 9-10x difference between raw db reads and using the trie cache. The "benchmark storage" is also not really written in a way the normal block access is done. It always gets a new state instance, while the block import/building is always working with the same instance. This probably also brings some slow down. The issues that I found are because of the way the local and shared cache are working. The local cache has some magic constants that determine the maximum size of this cache and there are another set of magical constants that determine how many entries from the local cache can be written back to the shared cache. The shared cache is also behaving a little bit weird when it comes to the calculation of required size (or at least something to check if everything adds up). I have written down the following tasks that need to be done:
|
Thank you @bkchr, I can also reproduce your ~7x speedup with your branch.
Additionally, we also made some investigations on why running with No trie-cache, default rocks db cache 1GiB
Trie-cache 1GiB, default rocks db cache 1GiB
No trie-cache, set rocks db cache to 128Mib
Trie-cache 1Gib, rocks db cache 128Mib
Trie-cache 10GiB, default rocks db cache 1GiB
Takeaways
|
The impact of always travsering the trie + decoding every node on every access is around 2-3x performance impact (if I remember correctly). Given that the cache is not working correctly, this is more the explanation why you don't see any impact on the benchmark. |
Another thing that just came to my mind. When the cache is not big enough to hold the entire state in cache, nodes that we touched earlier are more likely to be removed. Yeah, not that special with a LRU cache, but if we for example remove the lower layers of the trie, the performance impact could be bigger. Basically what I want to say is, the weighting on what should be removed from the LRU cache should be done based on the depth of the node, because the lower the depth, the higher the likelihood it will be used again by any other query. Like the storage root that will be always required when going down the trie. |
I wanted to write "should be done based on the depth of the node combined by the LRU", because at some point we want to remove these nodes that are not access that often. |
@alexggh and me discussed this in DM a little bit further. He looked into the inline cache size. Each |
Alright, I took a closer look and I want to make sure we are going into the right direction here, so let me recap from what I've understood the end goal of this ticket, current state of the caching, open questions from my side, potentially related issues, what's next. End goal@athei stated it pretty clearly here: #6131 (comment) "Let me state the objective of that issue a bit more high level: We want to put the whole (AssetHub) state in RAM and have the weights reflect that. Completely eradicating disk access form the equation. Having a state sized trie cache and warming that up is the assumed best short term solution for that" Current state of cachingAs of now besides the OS caching the polkadot node has two types of caching the rocksdb database cache which works at db level and by default is configured to 1GiB and the TrieCache(benefits explained in the link) which sits on top of it. The current TrieCache is implemented with two components the The size of the Read pathPolkadot nodes always try to find the node in Nodes and Values get promoted from the Write pathThe writes seem to be write-through we write both to the main rocksdb database and to the A for reads nodes and values get get promoted to the Related issues
During the kusama spammening relay-chain some overweight blocks were created, https://github.com/ordian/replay-kusama-25876038?tab=readme-ov-file#benchmark-results-on-the-reference-hardware, the trie-cache does not seem to impact the execution time in any way : https://hackmd.io/@Ww6uNnIISmqGwXufqPYPOw/S1IIq31uyx, although it might be related somehow with the cache performance I think this should be treated as a different issue. Open questions
Next
|
Would be interesting to see the cache hits with |
I've spent some time investigating this, looking into the write benchmark we can see that this is split into two parts:
Running the write benchmarks(with asset-hub-westend state) gives me on my machine the the following results Computing the storage root could massively benefit from using a warmed TrieCache, so I decided to enable it here, and I re-run the same benchmark, and the results are My takeawayFrom this data, I don't think we would gain other significant improvements by trying to remove the writes to |
Okay let me phrase the objective even more high level: The costs associated with reads and writes (as expressed by the runtime weights) are way too high. With everything in RAM they should not be in double digit microseconds. This is why I assumed it is disk access which is the issue. If something else is the issue then this needs to be fixed.
Costs are still too high. It seems that rebuilding the trie root on every write is the issue here not writing to rocksDB. So this has to be addressed. The write to Rocksdb (while the smaller portion) still takes double digit microseconds. A writeback solution would be preferable. We have the RAM. I don't see numbers on how the trie cache improved read performance. Only write performance. What was the effect there?
I guess we don't know. Weights could be underestimated in any place. Doesn't have to be related to storage access. Since the cache size didn't affect how long the blocks took to import the issue might be somewhere else.
The use case is contract execution. They have to access 5 different storage items to even start executing. That is also true for every sub call they do. Because the read ref tome overhead per read is so high we are getting bottlenecked by this. Contracts also access a lot of small storage items due to the fact that they can only store 256bit (one EVM word) per item.
It doesn't meaningfully improve performance. Read accesses are cached and write accesses seem to be bottlenecked by CPU and the slowness of RockDB writes can hopefully be fixed by some config changes (write back)
Yes. We aim for that cutoff so that we can have one Kusama release with that new weights before deploying to Polkadot. |
On my machine which is 1.5x faster than the reference machine built with |
The interesting number would be with/without try cache in the same machine. Getting down the read costs a lot would already be a big win. The writes seem to be complicated for the reasons you point out. But reads should get sped up much faster than 1.5x when they were hitting the disk before. |
Sorry, I mislead you with the Regarding runs with the cache trie and without the results looks like this: Without cache trie:
With cache trie with hacks/fixes from here
|
Sorry I probably didn't look into each post properly. Thats 30x speedup?! Units are nanoseconds? If we can get that reflected in the read weights that would be great. So it seems that rolling this out in two steps seems like a good idea. The quick read win first and then the writes later. |
Yes and yes. |
Looking a bit more closer at this write benchmark, I don't think it measures correctly the cost of writes, because as far as I can tell when we import or create a block the storage root and the application of changes to the rocksdb happen at the end of the block.
|
This is good news. Recalculating the trie root and flushing to RocksDB on every write seemed to be nonsensical to me. This fits much more with my mental model now. The speed up with the trie cache seems to be much lower than for reads. I guess this makes sense since writes never had to hit the disk for every access even without the true cache. But even with batching they still cost much more than reads (12us vs 800ns). This all seems to click into place now. Still some open questions:
But my takeaway is this: Our weights are vastly overestimating the actual costs: 20us for reads and 100us for writes. The former because we benchmarked without a warmed up cache. The latter because the benchmark was not taking into account that much of it is a one time cost. |
Yes, the more keys you change the more expensive the root calculation and committing are.
It is roughly 60% spent on computing the storage root and 40% committing the changes.
Yeah, that's what worries me as well, although there are several other reasons that might explain it, like #6131 (comment) or maybe we wrongly computed the weights for the involved extrinsics. Also these basic operation costs highly depend on the actual state you are using, see, so maybe kusama rc state looks different. That's one of the reason I was asking for a high level usecase/test that we can use to make sure this micro-benchmarks actually correlate when you put them on our full stack with the smart contracts usecase. |
I suggest then to conduct the overweight block test/benchmark with a block filled with Conduct this check with a warmed up trie cache. Then check if the blocks are taking too long with the current weights (being overweight). If they are already overweight then we have a problem. But I suspect the blocks to be underweight because there is a lot of storage access but not a lot of computation in these transfers and we still charge accesses very highly. It is important that this uses the same contract and not different once becaue otherwise you probably hit the PoV limit first because of the amount of code that has to be included into the PoV. |
Yea this is reasonable. Maybe add a CLI flag to configure this. Not sure what a good default would be, maybe 100 or 1000? |
I've been asking myself this as well and I think it is higher, 10_000 or even 100_000, because 10_000 eats in total just 145ms and 100_000 around 1.25 secs, since these weights are important when you have a lot writes in the block, if you have just a few it doesn't matter much because you will always be within the time bounds. |
The real weight you would get when batch size is infinity. So basically when But there are practical problems with that. The simple idea is to just use a big enough But I think just splitting this into two benchmarks would make it simpler.
Then you would getter results for even small |
I think it became a bit hard to follow all the threads in this issue, so I decided to summarise and provide a breakdown here https://hackmd.io/@o1WwXXhpRRmk87EoSBRzqA/Hk6D4NfFJg, let me know if you think anything is missing or if something is not clear. Once all good I'll move them here as Github sub-issues. |
As of right now the weight for database access is hard coded to 25us and 100us for read and write respectively. Those numbers are way too high given that we have a trie cache now.
We should write a benchmark that benchmarks storage access with a warm cache. Meaning that every access in the benchmark hits the cache. This is okay to assume as we will require the cache to be larger than the state size.
The results of that benchmark can be used for chains where we require collators to set a large
--trie-cache-size
. For example, AssetHub.@ggwpez Do we have a storage access benchmark already in the codebase somewhere?
We also need to add code that reads the whole database on node startup. This makes sure that all the state is in RAM.
The text was updated successfully, but these errors were encountered: