Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update database benchmarks to take trie cache into account #6131

Open
athei opened this issue Oct 18, 2024 · 36 comments
Open

Update database benchmarks to take trie cache into account #6131

athei opened this issue Oct 18, 2024 · 36 comments
Assignees

Comments

@athei
Copy link
Member

athei commented Oct 18, 2024

As of right now the weight for database access is hard coded to 25us and 100us for read and write respectively. Those numbers are way too high given that we have a trie cache now.

We should write a benchmark that benchmarks storage access with a warm cache. Meaning that every access in the benchmark hits the cache. This is okay to assume as we will require the cache to be larger than the state size.

The results of that benchmark can be used for chains where we require collators to set a large --trie-cache-size. For example, AssetHub.

@ggwpez Do we have a storage access benchmark already in the codebase somewhere?

We also need to add code that reads the whole database on node startup. This makes sure that all the state is in RAM.

@ggwpez
Copy link
Member

ggwpez commented Oct 21, 2024

Sorry did not see the ping, the code is here:

info!("Reading {} keys", keys.len());
for key in keys.as_slice() {

You can use it from the polkadot or polkadot-omni-node binary:
polkadot benchmark storage --state-version 1 -d ~/.local/share/polkadot or same with polkadot-omni-node, but then it also needs a --chain spec.

@athei athei moved this to Minimal Feature Launch in Smart Contracts Dec 4, 2024
@ggwpez
Copy link
Member

ggwpez commented Dec 9, 2024

Do you mostly need this for Polkadot+Kusama AssetHub or something else as well?
We would need to do it directly in the Runtimes repo.

@athei
Copy link
Member Author

athei commented Dec 9, 2024

Yes, it is needed only for AssetHub. But it would be good to also have those weights in polkadot-sdk so we can test on westend AssetHub. I would assume we add some InMemoryDBWeights that you are only supposed to use when you launch all your nodes with --in-ram-db=xxGB.

This flag would be the same as --trie-cache-size but does this in addition:

  1. Read the whole state on node startup so that it is in cache.
  2. Error out if the state doesn't fit in RAM plus some safety margin.

Would be awesome if the runtime somehow could advise the client which of those flags to set. But I guess this is not in the cards.

@athei athei assigned ggwpez and bkontur and unassigned ggwpez Dec 11, 2024
@bkontur
Copy link
Contributor

bkontur commented Dec 17, 2024

I would break this issue into several parts:

  1. Set up a process for regularly regenerating/maintaining (RocksDB) hard-coded storage read/write weights

    As you pointed out, we are using this hard-coded, copied weight everywhere. Some runtimes even directly include these values from frame_support, for example, here.

    Essentially, we are consistently using read: 25_000 * constants::WEIGHT_REF_TIME_PER_NANOS and write: 100_000 * constants::WEIGHT_REF_TIME_PER_NANOS without measuring them, as stated here:

    The cost of storage operations in a Substrate chain depends on the current chain state. It is therefore important to regularly update these weights as the chain grows.

  • As a first step, I would set up and fix runtimes (e.g., polkadot-sdk, polkadot-fellows) and regenerate the rocksdb_weights.rs file as a starting point.
  • Additionally, I would establish a process: whenever we regenerate weights for benchmarks, we should also benchmark storage using a current snapshot of the relevant chain state.
    • for live chains, probably this needs to download snapshot or just state, something like try-runtime is doing, just downloading storage keys and then run benchmark storage
  • I will also create a tracking issue for polkadot-fellows to ensure runtimes are fixed.
  • I would advice to parachain teams to do the same for their runtimes

POC:
I am playing here with AssetHubWestend and benchmark storage, but it reads just genesis with 12keys (investigating), I have storage with partially synced chain with finalized headers - still investigating...

./target/production/polkadot-parachain benchmark storage --state-version 1 --include-child-trees --chain cumulus/polkadot-parachain/chain-specs/asset-hub-westend.json --base-path /home/ubuntu/.local/share/polkadot-parachain/chains/asset-hub-westend/db/full
2024-12-17 15:47:19  creating SingleState txpool Limit { count: 8192, total_bytes: 20971520 }/Limit { count: 512, total_bytes: 1048576 }.    
2024-12-17 15:47:19 Warmup round 1/1    
2024-12-17 15:47:19 Preparing keys from block 0x67f9…98c9    
2024-12-17 15:47:19 Reading 12 keys    
2024-12-17 15:47:19 Reading 0 child keys

  1. Add storage benchmarking support to frame-omni-bencher
  • This also means updating benchmark bots to support storage benchmarking.
    This support may be necessary mainly for bot operations to avoid needing to build the entire polkadot or polkadot-parachain binary.

  1. Node (polkadot/polkadot-parachain/polkadot-omni-node) Integration

I would assume we add some InMemoryDBWeights that you are only supposed to use when you launch all your nodes with --in-ram-db=xxGB.

I’m not entirely sure I understand your point. If you mean setting InMemoryDBWeights instead of RocksDBWeights in the runtime (e.g., here, that approach won’t work because the node cannot control the runtime. In other words, a node’s --in-ram-db flag cannot toggle the runtime’s switch between InMemoryDBWeights and RocksDBWeights.

I believe there’s an alternative approach where the runtime determines the node’s behavior, rather than the other way around. For example, the runtime specifies type DbWeight = RocksDbWeight;, and we can read this configuration in the node. The node’s startup parameters, such as --trie-cache-size and/or --in-ram-db, could then be adjusted dynamically based on the runtime’s DbWeight values.

Am I missing something here, or does this approach make sense?

image

@athei
Copy link
Member Author

athei commented Dec 17, 2024

I’m not entirely sure I understand your point. If you mean setting InMemoryDBWeights instead of RocksDBWeights in the runtime (e.g., here, that approach won’t work because the node cannot control the runtime. In other words, a node’s --in-ram-db flag cannot toggle the runtime’s switch between InMemoryDBWeights and RocksDBWeights.

This is why I wrote you are only supposed to set this value when you made sure your nodes all start with said cli switch. It is a convention. Just like all the rest of the weights (you need to have a certain hardware spec).

I believe there’s an alternative approach where the runtime determines the node’s behavior, rather than the other way around. For example, the runtime specifies type DbWeight = RocksDbWeight;, and we can read this configuration in the node. The node’s startup parameters, such as --trie-cache-size and/or --in-ram-db, could then be adjusted dynamically based on the runtime’s DbWeight values.

Yes this would be a good safe guard. But I wouldn't set it automatically. Rather I would error out if the command is not set or a warning that needs to be overwritten. It is still a heuristic.

Regarding your plan. I would expect to see "changing the benchmarks to measure in-memory access" somewhere. We also need to find out of the trie cache is write back or write through. If the former is the case the cache is probably too slow. Benchmarking will reveal that.

@bkontur
Copy link
Contributor

bkontur commented Dec 18, 2024

4. InMemoryDBWeights - changing the benchmarks to measure in-memory access

  • run benchmark storage for AssetHubs (at first for Westend) to generate InMemoryDBWeights with --trie-cache-size and make sure all keys are warm-up-ed in cache (TrieCache/SharedTrieCache + print some stats to see TrieHitStats).
    • iiuc InMemoryDBWeights should be more like InMemoryTrieWithRocksDBWeights, right?
    • force benchmark storage to use some snapshot or synced db and trigger this on some high/recent block, because by default genesis has just 12 keys
  • set type DbWeight = InMemoryTrieWithRocksDBWeights for AssetHubWestend runtime
  • now, regenerate all the weights for AssetHubWestend

5. Investigate trie cache is write back or write through
We also need to find out of the trie cache is write back or write through. If the former is the case the cache is probably too slow. Benchmarking will reveal that.


Ok, let's say we release new polkadot-fellows runtime AssetHubPolkadot/AssetHubKusama/Westend with benchmarked InMemoryTrieWithRocksDBWeights (benchmark storage + benchmark pallets with very high --trie-cache-size).

This is why I wrote you are only supposed to set this value when you made sure your nodes all start with said cli switch.

Could you please elaborate on this, specifically on "when you made sure your nodes all start with the said CLI switch" and "your nodes"? How would you achieve or ensure this for AssetHubPolkadot/AssetHubKusama? I mean, anyone can run a node/collator with a different node version and/or without the trie-cache-size option. Ok, we can add good safe guard with heuristic on node startup, but what about nodes which would not have this version? Or I don't know, does this affect e.g. light-clients or just collators which build blocks?

@athei
Copy link
Member Author

athei commented Dec 18, 2024

iiuc InMemoryDBWeights should be more like InMemoryTrieWithRocksDBWeights, right?

The on disk format shouldn't be relevant as it will not hit rocksdb synchronously (assuming write back cache). Maybe TrieCachedWeights is a better name.

force benchmark storage to use some snapshot or synced db and trigger this on some high/recent block, because by default genesis has just 12 keys

Yeah it is a good idea to inspect how the trie cache scales with size.

Could you please elaborate on this, specifically on "when you made sure your nodes all start with the said CLI switch" and "your nodes"? How would you achieve or ensure this for AssetHubPolkadot/AssetHubKusama? I mean, anyone can run a node/collator with a different node version and/or without the trie-cache-size option. Ok, we can add good safe guard with heuristic on node startup, but what about nodes which would not have this version?

How would you safeguard against somebody running their node on a slow computer? You can't. It is exactly the same thing here. We are raising the hardware requirements essentially. So we communicate this new requirement and any safeguard we can add is nice. But not required. We then update the runtime after enough people upgraded their node (but at least all collators).

Or I don't know, does this affect e.g. light-clients or just collators which build blocks?

Light clients and validator nodes are not affected because they are stateless. Collators and full nodes are affected.

@bkontur
Copy link
Contributor

bkontur commented Jan 9, 2025

@athei I played a bit with benchmark storage for AssetHubWestend on a high block 10,384,607, which has 975,060 keys and 1,583 child keys, using different setups:

Setup of trie-cache READ WRITE
no trie cache 10_816 * constants::WEIGHT_REF_TIME_PER_NANOS 40_633 * constants::WEIGHT_REF_TIME_PER_NANOS
--enable-trie-cache default trie cache: (67108864 Bytes) 13_343 * constants::WEIGHT_REF_TIME_PER_NANOS 40_843 * constants::WEIGHT_REF_TIME_PER_NANOS
--enable-trie-cache / --trie-cache-size 1073741824 (1GB trie cache) 8_887 * constants::WEIGHT_REF_TIME_PER_NANOS 40_787 * constants::WEIGHT_REF_TIME_PER_NANOS

Notes/Hints:

  • From the numbers, it seems that only read operations are affected by the trie-cache (likely as expected).
  • Basti changed the default trie-cache size to 1 GiB for nodes here: paritytech/polkadot-sdk#6546.
  • For benchmarking, the trie-cache is disabled by default. You can enable it with:
    --enable-trie-cache (default size is 64 MiB)
    --trie-cache-size 1073741824 (1 GiB)
    
  • I tried to add some cache stats to the logs. I can see state_cache: MemorySize(621345584) but no other metrics, e.g., state_reads_cache: 0 and state_writes_cache: 0:
    2025-01-07 20:34:45.140  INFO main frame_benchmarking_cli::storage::cmd: Cache usage:
    Some(UsageInfo { memory: MemoryInfo { state_cache: MemorySize(622322987), database_cache: MemorySize(0) }, io: IoInfo { transactions: 0, bytes_read: 0, bytes_written: 0, writes: 0, reads: 0, average_transaction_size: 0, state_reads: 976237, state_reads_cache: 0, state_writes: 0, state_writes_cache: 0, state_writes_nodes: 0 } })
    
    • Is there a bug here with state_reads_cache or state_writes_cache?
    • Should we also consider database_cache in this context?
  • I discovered that for my point 1., there is an existing issue: Run storage benchmarks before release polkadot-fellows/runtimes#491.

no trie cache

bkontur@toaster1:~/polkadot-sdk$ ./target/production/polkadot-parachain benchmark storage --state-version 1     --base-path /home/bkontur/.local/share/polkadot-parachain     --chain cumulus/polkadot-parachain/chain-specs/asset-hub-westend.json     --include-child-trees     --detailed-log-output

2025-01-06 23:29:28.475  INFO main txpool:  creating SingleState txpool Limit { count: 8192, total_bytes: 20971520 }/Limit { count: 512, total_bytes: 1048576 }.
2025-01-06 23:29:30.319  INFO main frame_benchmarking_cli::storage::cmd: Warmup round 1/1
2025-01-06 23:29:41.026  INFO main frame_benchmarking_cli::storage::read: Preparing keys from block 0x6b8a1b649c7876fc0fb4c1ef1a51f07ea97bf3322ee826977b5c23404e61a574
2025-01-06 23:29:42.565  INFO main frame_benchmarking_cli::storage::read: Reading 975060 keys
2025-01-06 23:29:53.298  INFO main frame_benchmarking_cli::storage::read: Reading 1583 child keys
2025-01-06 23:29:53.404  INFO main frame_benchmarking_cli::storage::cmd: Time summary [ns]:
Total: 10559479290
Min: 972, Max: 636946
Average: 10816, Median: 11007, Stddev: 1755.06
Percentiles 99th, 95th, 75th: 14101, 13110, 11968
Value size summary:
Total: 58047152
Min: 0, Max: 1693102
Average: 59, Median: 64, Stddev: 1899.09
Percentiles 99th, 95th, 75th: 80, 80, 80
2025-01-06 23:29:54.950  INFO main frame_benchmarking_cli::storage::cmd: Warmup round 1/1
2025-01-06 23:30:05.582  INFO main frame_benchmarking_cli::storage::write: Preparing keys from block 0x6b8a1b649c7876fc0fb4c1ef1a51f07ea97bf3322ee826977b5c23404e61a574
2025-01-06 23:30:07.629  INFO main frame_benchmarking_cli::storage::write: Writing 975060 keys
2025-01-06 23:31:49.859  INFO main frame_benchmarking_cli::storage::write: Writing 1583 child keys
2025-01-06 23:31:50.005  INFO main frame_benchmarking_cli::storage::cmd: Time summary [ns]:
Total: 39667485804
Min: 4597, Max: 10498130
Average: 40633, Median: 40101, Stddev: 32806.69
Percentiles 99th, 95th, 75th: 51758, 48543, 44127
Value size summary:
Total: 58047152
Min: 0, Max: 1693102
Average: 59, Median: 64, Stddev: 1899.09
Percentiles 99th, 95th, 75th: 80, 80, 80
2025-01-06 23:31:50.005  INFO main frame_benchmarking_cli::storage::cmd: Cache usage:
Some(UsageInfo { memory: MemoryInfo { state_cache: MemorySize(0), database_cache: MemorySize(0) }, io: IoInfo { transactions: 0, bytes_read: 0, bytes_written: 0, writes: 0, reads: 0, average_transaction_size: 0, state_reads: 1583, state_reads_cache: 0, state_writes: 0, state_writes_cache: 0, state_writes_nodes: 0 } })
2025-01-06 23:31:50.005  INFO main frame_benchmarking_cli::storage::template: Writing weights to "/home/bkontur/polkadot-sdk/rocksdb_weights.rs"

default trie cache: (67108864 Bytes)

bkontur@toaster1:~/polkadot-sdk$ ./target/production/polkadot-parachain benchmark storage --state-version 1 \
    --base-path /home/bkontur/.local/share/polkadot-parachain \
    --chain cumulus/polkadot-parachain/chain-specs/asset-hub-westend.json \
    --include-child-trees \
    --detailed-log-output \
    --enable-trie-cache
2025-01-06 23:35:00.615  INFO main txpool:  creating SingleState txpool Limit { count: 8192, total_bytes: 20971520 }/Limit { count: 512, total_bytes: 1048576 }.
2025-01-06 23:35:02.521  INFO main frame_benchmarking_cli::storage::cmd: Warmup round 1/1
2025-01-06 23:35:15.809  INFO main frame_benchmarking_cli::storage::read: Preparing keys from block 0x6b8a1b649c7876fc0fb4c1ef1a51f07ea97bf3322ee826977b5c23404e61a574
2025-01-06 23:35:17.403  INFO main frame_benchmarking_cli::storage::read: Reading 975060 keys
2025-01-06 23:35:30.643  INFO main frame_benchmarking_cli::storage::read: Reading 1583 child keys
2025-01-06 23:35:30.745  INFO main frame_benchmarking_cli::storage::cmd: Time summary [ns]:
Total: 13026510157
Min: 601, Max: 1425945
Average: 13343, Median: 12960, Stddev: 7077.46
Percentiles 99th, 95th, 75th: 20401, 17377, 14472
Value size summary:
Total: 58047152
Min: 0, Max: 1693102
Average: 59, Median: 64, Stddev: 1899.09
Percentiles 99th, 95th, 75th: 80, 80, 80
2025-01-06 23:35:32.334  INFO main frame_benchmarking_cli::storage::cmd: Warmup round 1/1
2025-01-06 23:35:45.486  INFO main frame_benchmarking_cli::storage::write: Preparing keys from block 0x6b8a1b649c7876fc0fb4c1ef1a51f07ea97bf3322ee826977b5c23404e61a574
2025-01-06 23:35:47.592  INFO main frame_benchmarking_cli::storage::write: Writing 975060 keys
2025-01-06 23:37:30.209  INFO main frame_benchmarking_cli::storage::write: Writing 1583 child keys
2025-01-06 23:37:30.343  INFO main frame_benchmarking_cli::storage::cmd: Time summary [ns]:
Total: 39873256281
Min: 4457, Max: 10471867
Average: 40843, Median: 40341, Stddev: 35949.63
Percentiles 99th, 95th, 75th: 51789, 48544, 44167
Value size summary:
Total: 58047152
Min: 0, Max: 1693102
Average: 59, Median: 64, Stddev: 1899.09
Percentiles 99th, 95th, 75th: 80, 80, 80
2025-01-06 23:37:30.343  INFO main frame_benchmarking_cli::storage::cmd: Cache usage:
Some(UsageInfo { memory: MemoryInfo { state_cache: MemorySize(39137407), database_cache: MemorySize(0) }, io: IoInfo { transactions: 0, bytes_read: 0, bytes_written: 0, writes: 0, reads: 0, average_transaction_size: 0, state_reads: 1583, state_reads_cache: 0, state_writes: 0, state_writes_cache: 0, state_writes_nodes: 0 } })
2025-01-06 23:37:30.343  INFO main frame_benchmarking_cli::storage::template: Writing weights to "/home/bkontur/polkadot-sdk/rocksdb_weights.rs"

1GB trie cache

bkontur@toaster1:~/polkadot-sdk$ ./target/production/polkadot-parachain benchmark storage --state-version 1 \
    --base-path /home/bkontur/.local/share/polkadot-parachain \
    --chain cumulus/polkadot-parachain/chain-specs/asset-hub-westend.json \
    --include-child-trees \
    --detailed-log-output \
    --trie-cache-size 1073741824 \
    --enable-trie-cache
2025-01-06 23:41:40.112  INFO main txpool:  creating SingleState txpool Limit { count: 8192, total_bytes: 20971520 }/Limit { count: 512, total_bytes: 1048576 }.
2025-01-06 23:41:41.976  INFO main frame_benchmarking_cli::storage::cmd: Warmup round 1/1
2025-01-06 23:41:53.921  INFO main frame_benchmarking_cli::storage::read: Preparing keys from block 0x6b8a1b649c7876fc0fb4c1ef1a51f07ea97bf3322ee826977b5c23404e61a574
2025-01-06 23:41:55.453  INFO main frame_benchmarking_cli::storage::read: Reading 975060 keys
2025-01-06 23:42:04.319  INFO main frame_benchmarking_cli::storage::read: Reading 1583 child keys
2025-01-06 23:42:04.418  INFO main frame_benchmarking_cli::storage::cmd: Time summary [ns]:
Total: 8676248317
Min: 591, Max: 16262784
Average: 8887, Median: 10696, Stddev: 23789.4
Percentiles 99th, 95th, 75th: 18438, 15183, 12158
Value size summary:
Total: 58047152
Min: 0, Max: 1693102
Average: 59, Median: 64, Stddev: 1899.09
Percentiles 99th, 95th, 75th: 80, 80, 80
2025-01-06 23:42:05.959  INFO main frame_benchmarking_cli::storage::cmd: Warmup round 1/1
2025-01-06 23:42:15.071  INFO main frame_benchmarking_cli::storage::write: Preparing keys from block 0x6b8a1b649c7876fc0fb4c1ef1a51f07ea97bf3322ee826977b5c23404e61a574
2025-01-06 23:42:17.133  INFO main frame_benchmarking_cli::storage::write: Writing 975060 keys
2025-01-06 23:43:59.560  INFO main frame_benchmarking_cli::storage::write: Writing 1583 child keys
2025-01-06 23:43:59.694  INFO main frame_benchmarking_cli::storage::cmd: Time summary [ns]:
Total: 39818611219
Min: 4437, Max: 11232225
Average: 40787, Median: 40181, Stddev: 34484.34
Percentiles 99th, 95th, 75th: 52259, 48935, 44337
Value size summary:
Total: 58047152
Min: 0, Max: 1693102
Average: 59, Median: 64, Stddev: 1899.09
Percentiles 99th, 95th, 75th: 80, 80, 80
2025-01-06 23:43:59.694  INFO main frame_benchmarking_cli::storage::cmd: Cache usage:
Some(UsageInfo { memory: MemoryInfo { state_cache: MemorySize(621345584), database_cache: MemorySize(0) }, io: IoInfo { transactions: 0, bytes_read: 0, bytes_written: 0, writes: 0, reads: 0, average_transaction_size: 0, state_reads: 1583, state_reads_cache: 0, state_writes: 0, state_writes_cache: 0, state_writes_nodes: 0 } })
2025-01-06 23:43:59.694  INFO main frame_benchmarking_cli::storage::template: Writing weights to "/home/bkontur/polkadot-sdk/rocksdb_weights.rs"

@athei
Copy link
Member Author

athei commented Jan 10, 2025

This does not look very promising. A cached vs disk access should be a much bigger difference. Probably those benchmarks are not measuring the right thing. Have you looked into those benchmarks that Basti wrote for the trie cache?

@eskimor
Copy link
Member

eskimor commented Jan 22, 2025

A cached vs disk access should be a much bigger difference.

Indeed, I was puzzled by the same thing in our Spammening runs.

@athei athei moved this from Minimal Feature Launch to Polkadot APIs in Smart Contracts Jan 24, 2025
@serban300 serban300 assigned serban300 and unassigned bkontur Jan 24, 2025
@athei
Copy link
Member Author

athei commented Jan 24, 2025

A cached vs disk access should be a much bigger difference.

Indeed, I was puzzled by the same thing in our Spammening runs.

Were the collators running into timeout or weight limit?

@serban300 Is taking this over. Thank you. Keep in mind that this is a very involved issue. The benchmarks we have probably measure the wrong thing. We would need to take into account ssd cache, OS cache and all those things. Maybe even come up with a macro benchmarks instead of the micro benchmarks we are doing. It also requires understanding the whole stack and what exactly the trie cache is doing.

Let me state the objective of that issue a bit more high level: We want to put the whole (AssetHub) state in RAM and have the weights reflect that. Completely eradicating disk access form the equation. Having a state sized trie cache and warming that up is the assumed best short term solution for that. It could be that this assumption is wrong.

@sandreim
Copy link
Contributor

sandreim commented Jan 24, 2025

@athei We are also investigating this. Let's not duplicate the effort :)

We have this document to track the work being done and progress: https://hackmd.io/KKFGKcmXTom84ot32rSKmw?both

@sandreim sandreim assigned alexggh and unassigned serban300 Jan 24, 2025
@bkchr
Copy link
Member

bkchr commented Jan 24, 2025

It also requires understanding the whole stack and what exactly the trie cache is doing.

I'm quite sure there is a bug. If you run my benchmarks, the trie cache variant is much slower than back in the days. It is actually slower than the version using the db :D Something broke, but I don't know yet what.

@bkchr
Copy link
Member

bkchr commented Jan 26, 2025

So, I looked into it. I have identified multiple issues, but first let's start with the benchmark. (You can replicate them by using the bkchr-cache-hack branch)

With trie cache:

Total: 5284491790
Min: 580, Max: 802864
Average: 5408, Median: 1090, Stddev: 5271.43
Percentiles 99th, 95th, 75th: 14888, 11718, 9889
Value size summary:
Total: 59143159
Min: 0, Max: 1722470
Average: 60, Median: 64, Stddev: 1955.09
Percentiles 99th, 95th, 75th: 80, 80, 80

Without trie cache:

Total: 34594303769
Min: 5900, Max: 3538831
Average: 35404, Median: 35089, Stddev: 9944.67
Percentiles 99th, 95th, 75th: 49478, 44888, 38728
Value size summary:
Total: 59143159
Min: 0, Max: 1722470
Average: 60, Median: 64, Stddev: 1955.09
Percentiles 99th, 95th, 75th: 80, 80, 80

So, in average there is a 7x speed up. I think back in the days I had a 9-10x difference between raw db reads and using the trie cache. The "benchmark storage" is also not really written in a way the normal block access is done. It always gets a new state instance, while the block import/building is always working with the same instance. This probably also brings some slow down.

The issues that I found are because of the way the local and shared cache are working. The local cache has some magic constants that determine the maximum size of this cache and there are another set of magical constants that determine how many entries from the local cache can be written back to the shared cache. The shared cache is also behaving a little bit weird when it comes to the calculation of required size (or at least something to check if everything adds up).

I have written down the following tasks that need to be done:

  • Find out why it wants to allocate over 3GB for the inline_size of the shared cache in the state_access benchmark. Does the number is correct? Does everything with the LRU limiters work correctly?
  • Introduce some sort of "trusted cache". When importing/building blocks it should be fine to have a bigger local cache and allow to write back more to the global cache.
  • Introduce a way to fetch the entire state to the shared cache. If the entire state doesn't fit we need to be somehow smarter on what we cache. Maybe re-executing the X last blocks and build some sort of filter based on this?

@alexggh
Copy link
Contributor

alexggh commented Jan 27, 2025

So, I looked into it. I have identified multiple issues, but first let's start with the benchmark. (You can replicate them by using the bkchr-cache-hack branch)

Thank you @bkchr, I can also reproduce your ~7x speedup with your branch.

This does not look very promising. A cached vs disk access should be a much bigger difference. Probably those benchmarks are not measuring the right thing. Have you looked into those benchmarks that Basti wrote for the trie cache?

Additionally, we also made some investigations on why running with --trie-cache-size 1073741824 #6131 (comment) seemed to have a low impact, our tests shows us that when you run polkadot-parachain benchmark storage the default Rocksdb cache(--db-cache) is set to 1GiB, hence why a very small impact because your are not hitting the disk most of the time, setting the trie cache even higher to 10GiB gets you something like this for reads on my machine with westend-asset-hub state.

No trie-cache, default rocks db cache 1GiB

Average: 25003, Median: 24394, Stddev: 11272.91

Trie-cache 1GiB, default rocks db cache 1GiB

Average: 11606, Median: 13678, Stddev: 22274.85

No trie-cache, set rocks db cache to 128Mib

Average: 183478, Median: 160212, Stddev: 84591.13

Trie-cache 1Gib, rocks db cache 128Mib

Average: 25120, Median: 15921, Stddev: 37264.04

Trie-cache 10GiB, default rocks db cache 1GiB

Average: 739, Median: 711, Stddev: 547.62

Takeaways

  • Trie-cache does have a bit impact when set properly + some fixes discovered by @bkchr Update database benchmarks to take trie cache into account #6131 (comment) but that seems sometimes reduced because rocks-db cache does already its job.
  • The read-write weights that we currently use, seem to already take into account the presence of some in-memory cache(rocksdb cache), without that the results seem a lot worse on my machine.

@bkchr
Copy link
Member

bkchr commented Jan 27, 2025

the default Rocksdb cache(--db-cache) is set to 1GiB, hence why a very small impact because your are not hitting the disk most of the time,

The impact of always travsering the trie + decoding every node on every access is around 2-3x performance impact (if I remember correctly). Given that the cache is not working correctly, this is more the explanation why you don't see any impact on the benchmark.

@bkchr
Copy link
Member

bkchr commented Jan 27, 2025

Another thing that just came to my mind. When the cache is not big enough to hold the entire state in cache, nodes that we touched earlier are more likely to be removed. Yeah, not that special with a LRU cache, but if we for example remove the lower layers of the trie, the performance impact could be bigger. Basically what I want to say is, the weighting on what should be removed from the LRU cache should be done based on the depth of the node, because the lower the depth, the higher the likelihood it will be used again by any other query. Like the storage root that will be always required when going down the trie.

@bkchr
Copy link
Member

bkchr commented Jan 27, 2025

the weighting on what should be removed from the LRU cache should be done based on the depth of the node

I wanted to write "should be done based on the depth of the node combined by the LRU", because at some point we want to remove these nodes that are not access that often.

@bkchr
Copy link
Member

bkchr commented Jan 29, 2025

@alexggh and me discussed this in DM a little bit further. He looked into the inline cache size. Each OwnedNode requires 750bytes which leads to a memory usage of 3GB, which I observed. After that he as done an experiment to check what the real memory usage is, by only counting the set child node hashes. This show a potential memory reduction of 4x. The idea is now to put the Hashes into heap, which will reduce the inline size. The hashes should be allocated onto some sort of arena allocator, to not kill the performance by allocating that many small hashes.

@alexggh
Copy link
Contributor

alexggh commented Jan 29, 2025

Alright, I took a closer look and I want to make sure we are going into the right direction here, so let me recap from what I've understood the end goal of this ticket, current state of the caching, open questions from my side, potentially related issues, what's next.

End goal

@athei stated it pretty clearly here: #6131 (comment) "Let me state the objective of that issue a bit more high level: We want to put the whole (AssetHub) state in RAM and have the weights reflect that. Completely eradicating disk access form the equation. Having a state sized trie cache and warming that up is the assumed best short term solution for that"

Current state of caching

As of now besides the OS caching the polkadot node has two types of caching the rocksdb database cache which works at db level and by default is configured to 1GiB and the TrieCache(benefits explained in the link) which sits on top of it.

The current TrieCache is implemented with two components the SharedTrieCache and the LocalTrieCache with each having different in memory LRUMaps for Nodes and Values.

The size of the SharedTrieCache is configurable with --trie-cache-size, which gets equally split between Nodes and Values. On the other hand the LocalTrieCache size is hardcoded to ~8.5MiB for nodes and ~2.5MiB for values.

Read path

Polkadot nodes always try to find the node in LocalTrieCache, then they try SharedTrieCache and at the end they go to the rocksdb database with its cache, when the value is retrieved LocalTrieCache gets populated with the new node, but not the shared cached.

Nodes and Values get promoted from the LocalTrieCache to SharedTrieCache only when LocalTrieCache gets dropped and even on that path the promotion is limited to only 1792 of each type.

Write path

The writes seem to be write-through we write both to the main rocksdb database and to the LocalTrieCache, https://github.com/paritytech/trie/blob/c3348e4b335443c17586a8b7491ba893cfde76a6/trie-db/src/triedbmut.rs#L1841, so the TrieCache wouldn't give us any speed-up on this path.

A for reads nodes and values get get promoted to the SharedTrieCache only when LocalTrieCache gets dropped and even on that path the promotion is limited to only 1792 of each type

Related issues

A cached vs disk access should be a much bigger difference.

Indeed, I was puzzled by the same thing in our Spammening runs.

During the kusama spammening relay-chain some overweight blocks were created, https://github.com/ordian/replay-kusama-25876038?tab=readme-ov-file#benchmark-results-on-the-reference-hardware, the trie-cache does not seem to impact the execution time in any way : https://hackmd.io/@Ww6uNnIISmqGwXufqPYPOw/S1IIq31uyx, although it might be related somehow with the cache performance I think this should be treated as a different issue.

Open questions

  • Is the write-through method good enough ? We need to look deeper as to how the rocksdb cache affects this.

  • Assuming the end-goal is reached and we hold the entire state in memory, we will get a cpu speed up and that will allow us to pack more transaction into a parachain block, however there might be other bottlenecks that might prevent us from doing so, PoV size is the first to come to mind, I'm aware there are ways to get more here by increasing the max PoV size to 10MiB Set PoV size limit to 10 Mb #5884, or using more cores with elastic scaling, but there might be other unseen issues, so do we have like a high level use-case/test that can use as a north star to guide us we are achieving the expected results?

  • Are the kusama overweight blocks caused by a buggy/overweight cache ?

  • What about paritydb, it seems to me that we are still not using it, does anyone think it would help on this path ? Should we maybe consider running with it ?

  • Timeline, I see this ticket is marked as needed by 1 May 2025, is that accurate ?

Next

@bkchr
Copy link
Member

bkchr commented Jan 29, 2025

https://hackmd.io/@Ww6uNnIISmqGwXufqPYPOw/S1IIq31uyx, although it might be related somehow with the cache performance I think this should be treated as a different issue.

Would be interesting to see the cache hits with trie-cache=debug. One thing that the state_access and benchmark storage do not measure is the impact of the runtime => host overhead when fetching the values from the runtime. It could be that the overhead is soo big, that any optimizations done by the cache are not measurable.

@alexggh
Copy link
Contributor

alexggh commented Feb 3, 2025

Is the write-through method good enough ? We need to look deeper as to how the rocksdb cache affects this.

I've spent some time investigating this, looking into the write benchmark we can see that this is split into two parts:

Running the write benchmarks(with asset-hub-westend state) gives me on my machine the the following results Average: 58 us, which is split into, computing the storage root(P1) ~43 us and committing the changes to rocks-db database(P2) ~15us, to be noticed that writing the changes to rocksdb is not the bottleneck here, the rocksdb Write Ahead Log seems to be doing its job.

Computing the storage root could massively benefit from using a warmed TrieCache, so I decided to enable it here, and I re-run the same benchmark, and the results are Average: 36 us with computing the storage now taking Average: 21 us and the commit to the database remains unchanged to around (~15us). On this run the computing of storage(P1) part runs with everything in memory with most of the time spent on encoding the nodes and inserting them in the PrefixedMemoryDB

My takeaway

From this data, I don't think we would gain other significant improvements by trying to remove the writes to rocksdb from the hot-path, because the time it takes to perform a Trie key write seems to actually be bounded by the computing of the storage root.

@bkchr @athei @ggwpez Thoughts ?

@athei
Copy link
Member Author

athei commented Feb 4, 2025

Okay let me phrase the objective even more high level: The costs associated with reads and writes (as expressed by the runtime weights) are way too high. With everything in RAM they should not be in double digit microseconds. This is why I assumed it is disk access which is the issue. If something else is the issue then this needs to be fixed.

Is the write-through method good enough ? We need to look deeper as to how the rocksdb cache affects this.

Costs are still too high. It seems that rebuilding the trie root on every write is the issue here not writing to rocksDB. So this has to be addressed. The write to Rocksdb (while the smaller portion) still takes double digit microseconds. A writeback solution would be preferable. We have the RAM.

I don't see numbers on how the trie cache improved read performance. Only write performance. What was the effect there?

Are the kusama overweight blocks caused by a buggy/overweight cache ?

I guess we don't know. Weights could be underestimated in any place. Doesn't have to be related to storage access. Since the cache size didn't affect how long the blocks took to import the issue might be somewhere else.

Assuming the end-goal is reached and we hold the entire state in memory, we will get a cpu speed up and that will allow us to pack more transaction into a parachain block, however there might be other bottlenecks that might prevent us from doing so, PoV size is the first to come to mind, I'm aware there are ways to get more here by increasing the max PoV size to 10MiB Set PoV size limit to 10 Mb #5884, or using more cores with elastic scaling, but there might be other unseen issues, so do we have like a high level use-case/test that can use as a north star to guide us we are achieving the expected results?

The use case is contract execution. They have to access 5 different storage items to even start executing. That is also true for every sub call they do. Because the read ref tome overhead per read is so high we are getting bottlenecked by this. Contracts also access a lot of small storage items due to the fact that they can only store 256bit (one EVM word) per item.

What about paritydb, it seems to me that we are still not using it, does anyone think it would help on this path ? Should we maybe consider running with it ?

It doesn't meaningfully improve performance. Read accesses are cached and write accesses seem to be bottlenecked by CPU and the slowness of RockDB writes can hopefully be fixed by some config changes (write back)

Timeline, I see this ticket is marked as needed by 1 May 2025, is that accurate ?

Yes. We aim for that cutoff so that we can have one Kusama release with that new weights before deploying to Polkadot.

@alexggh
Copy link
Contributor

alexggh commented Feb 4, 2025

I don't see numbers on how the trie cache improved read performance. Only write performance. What was the effect there?

On my machine which is 1.5x faster than the reference machine built with production profile the average red is Average: 808 ns with westend-asset-hub state.

@athei
Copy link
Member Author

athei commented Feb 4, 2025

The interesting number would be with/without try cache in the same machine. Getting down the read costs a lot would already be a big win. The writes seem to be complicated for the reasons you point out. But reads should get sped up much faster than 1.5x when they were hitting the disk before.

@alexggh
Copy link
Contributor

alexggh commented Feb 4, 2025

The interesting number would be with/without try cache in the same machine. Getting down the read costs a lot would already be a big win. The writes seem to be complicated for the reasons you point out. But reads should get sped up much faster than 1.5x when they were hitting the disk before.

Sorry, I mislead you with the 1.5x, I was saying that the machine is more powerful than the normal reference machine we use for generating the weights so the absolute numbers wouldn't be the same, but I expect the speed up to correlate.

Regarding runs with the cache trie and without the results looks like this:

Without cache trie:

Average: 25517, Median: 24647, Stddev: 17128.59

With cache trie with hacks/fixes from here

Average: 863, Median: 846, Stddev: 537.93

@athei
Copy link
Member Author

athei commented Feb 4, 2025

Sorry I probably didn't look into each post properly. Thats 30x speedup?! Units are nanoseconds? If we can get that reflected in the read weights that would be great.

So it seems that rolling this out in two steps seems like a good idea. The quick read win first and then the writes later.

@alexggh
Copy link
Contributor

alexggh commented Feb 5, 2025

Thats 30x speedup?! Units are nanoseconds?

Yes and yes.

@alexggh
Copy link
Contributor

alexggh commented Feb 5, 2025

the write benchmark we can see that this is split into two parts:

Looking a bit more closer at this write benchmark, I don't think it measures correctly the cost of writes, because as far as I can tell when we import or create a block the storage root and the application of changes to the rocksdb happen at the end of the block.
So the benchmark factors in a lot of operations for which we would pay the price only once, so I modify the benchmarks to be able to batch more than one write operation and you get an additional speed up for the average write cost of about 2 to 3, the results look like this:

Setup Batch size in modified keys Average per key write(ns) Average per batch(ms)
Without trie cache 1 61300 0.0613
Without trie cache 100_000 18894 1889
With trie cache 1 34233 0.034
With trie cache 10_000 14532 145.324913
With trie cache 100_000 12511 1251

@athei
Copy link
Member Author

athei commented Feb 5, 2025

This is good news. Recalculating the trie root and flushing to RocksDB on every write seemed to be nonsensical to me. This fits much more with my mental model now.

The speed up with the trie cache seems to be much lower than for reads. I guess this makes sense since writes never had to hit the disk for every access even without the true cache. But even with batching they still cost much more than reads (12us vs 800ns).

This all seems to click into place now. Still some open questions:

  1. Why are writes so much more expensive if the db write and root calculation is only done once? I assume because the root calculation and db write still scales with the amount of changed keys? How much of that is root calculation (CPU) vs db write? The latter we can speed up by tweaking the RocksDB write to be write back + log (making it async). The former we can probably not do too much about except optimizing the code a bit.

  2. If the trie cache actually works well and the weights are overestimating the costs (especially for reads) why are the blocks still overweight with the trie cache enabled? Because this test was conducted without the fixes to the trie cache?

But my takeaway is this: Our weights are vastly overestimating the actual costs: 20us for reads and 100us for writes. The former because we benchmarked without a warmed up cache. The latter because the benchmark was not taking into account that much of it is a one time cost.

@alexggh
Copy link
Contributor

alexggh commented Feb 5, 2025

Why are writes so much more expensive if the db write and root calculation is only done once? I assume because the root calculation and db write still scales with the amount of changed keys?

Yes, the more keys you change the more expensive the root calculation and committing are.

How much of that is root calculation (CPU) vs db write? The latter we can speed up by tweaking the RocksDB write to be write back + log (making it async). The former we can probably not do too much about except optimizing the code a bit.

It is roughly 60% spent on computing the storage root and 40% committing the changes.

If the trie cache actually works well and the weights are overestimating the costs (especially for reads) why are the blocks still overweight with the trie cache enabled? Because this test was conducted without the fixes to the trie cache?

Yeah, that's what worries me as well, although there are several other reasons that might explain it, like #6131 (comment) or maybe we wrongly computed the weights for the involved extrinsics.

Also these basic operation costs highly depend on the actual state you are using, see, so maybe kusama rc state looks different. That's one of the reason I was asking for a high level usecase/test that we can use to make sure this micro-benchmarks actually correlate when you put them on our full stack with the smart contracts usecase.

@athei
Copy link
Member Author

athei commented Feb 5, 2025

I suggest then to conduct the overweight block test/benchmark with a block filled with pallet_revive:call extrinsics. I suggest using ERC20 transfers. Like instantiate one contract and then fill the block with ERC20 contracts using this contract but each for different accounts.

Conduct this check with a warmed up trie cache.

Then check if the blocks are taking too long with the current weights (being overweight). If they are already overweight then we have a problem. But I suspect the blocks to be underweight because there is a lot of storage access but not a lot of computation in these transfers and we still charge accesses very highly.

It is important that this uses the same contract and not different once becaue otherwise you probably hit the PoV limit first because of the amount of code that has to be included into the PoV.

@ggwpez
Copy link
Member

ggwpez commented Feb 6, 2025

So the benchmark factors in a lot of operations for which we would pay the price only once, so I modify the benchmarks to be able to batch more than one write operation and you get an additional speed up for the average write cost of about 2 to 3, the results look like this:

Yea this is reasonable. Maybe add a CLI flag to configure this. Not sure what a good default would be, maybe 100 or 1000?
Depends on how full a block would be, right? The fuller they are the more speed up we get from this batching.

@alexggh
Copy link
Contributor

alexggh commented Feb 6, 2025

Not sure what a good default would be, maybe 100 or 1000? Depends on how full a block would be, right? The fuller they are the more speed up we get from this batching.

I've been asking myself this as well and I think it is higher, 10_000 or even 100_000, because 10_000 eats in total just 145ms and 100_000 around 1.25 secs, since these weights are important when you have a lot writes in the block, if you have just a few it doesn't matter much because you will always be within the time bounds.

@athei
Copy link
Member Author

athei commented Feb 6, 2025

The real weight you would get when batch size is infinity. So basically when n runs towards inf the weight result will converge towards what we want. Because the constant overhead is already included in the bas extrinsic weight you want the actual contribution per key here.

But there are practical problems with that. The simple idea is to just use a big enough n until the result doesn't change much anymore.

But I think just splitting this into two benchmarks would make it simpler.

  1. Benchmark writse without the end of block trie calc and writeback over n (number of keys)
  2. Benchmark end of block overhead over n (number of keys).

Then you would getter results for even small n.

@alexggh
Copy link
Contributor

alexggh commented Feb 6, 2025

I think it became a bit hard to follow all the threads in this issue, so I decided to summarise and provide a breakdown here https://hackmd.io/@o1WwXXhpRRmk87EoSBRzqA/Hk6D4NfFJg, let me know if you think anything is missing or if something is not clear.

Once all good I'll move them here as Github sub-issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Polkadot APIs
Development

No branches or pull requests

8 participants