Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements concurrent Smt::compute_mutations #365

Merged
merged 15 commits into from
Feb 7, 2025

Conversation

krushimir
Copy link

This PR introduces a concurrent implementation of Smt::compute_mutations, leveraging an approach similar to the existing parallel construction logic.

Benchmark results were collected on a 64-core (128-thread) AMD EPYC 7662 processor, with Rayon’s thread pool explicitly limited to the specified thread counts.

For context, construction benchmarks are also included for performance comparison.

1. Construction Benchmark

10k key-value pairs

Threads Parallel Time (s) Sequential Time (s) Speedup
16 0.5 5.7 11.11x
32 0.4 5.7 15.22x
64 0.3 5.7 17.35x
128 0.4 5.7 16.90x
  • Optimal performance was achieved with 64 threads.
  • Diminishing returns were observed with 128 threads

2. Batched Insertion Benchmark

10k key-value pairs

Threads Parallel Time (ms) Sequential Time (ms) Speedup Avg Insert Time (μs)
16 517.0 6308.7 12.20x 52
32 395.8 6334.5 16.00x 40
64 333.0 6321.6 18.98x 33
128 383.7 6300.7 16.42x 38
  • 64 threads offered the best performance, reducing average insertion time to 33 μs.
  • Scaling beyond 64 threads led to slight performance degradation.

3. Batched Update Benchmark

10k key-value pairs

Threads Parallel Time (ms) Sequential Time (ms) Speedup Avg Update Time (μs)
16 482.7 6369.8 13.20x 48
32 357.7 6351.5 17.76x 36
64 304.7 6378.5 20.93x 30
128 273.5 6418.8 23.47x 27
  • Batched updates scaled better with increased threads.
  • 128 threads achieved the fastest update speed, reducing average time to 27 μs.

Copy link
Contributor

@PhilippGackstatter PhilippGackstatter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! I think the logic itself looks good. My comments are mostly about naming, docs and deduplication. I might have to take another look anyway, since I first had to understand how the Smt is implemented in sequential code 😅, so I'll just comment for now.

In general, I think adding comments to code parts that are not easy to understand would improve readability and understandability.

Regarding the approach, please correct me if I have misunderstandings, but my understanding of the approach is the following.

Assuming a tree of depth 64 with subtrees of depth 8 and mutations of just two (for example's sake) leaves at indices 0 and 65536, compute_mutations would do this, on a high-level and making some simple assumptions about how rayon assigns threads:

  1. Compute subtrees that were modified. This happens in sorted_pairs_to_mutated_leaves. This would yield two subtrees, covering the column ranges 0..256 and 65536..65792.
  2. Then in build_subtree_mutations, the subtrees are updated in parallel.
    • 1st iteration:
      • Thread 0: Compute updates for leaves with indices 0..256 at depth 64. Then updates for leaves at depth 63 within this subtree, and so on, until it eventually results in new root at depth 56, column 0.
      • Thread 1: Compute updates for leaves with indices 65536..65792 at depth 64. Then updates for leaves at depth 63 within this subtree, and so on, until it eventually results in new root at depth 56, column 256 (= 65536 >> 8).
    • 2nd iteration:
      • Thread 0: Compute updates for leaves with indices 0..256 at depth 56 (only root 0 has changed). Eventually this results in a new root at depth 48, column 0.
      • Thread 1: Compute updates for leaves with indices 256..512 at depth 56 (only root 256 has changed). Eventually this results in a new root at depth 48, column 1.
    • 3rd iteration:
      • Thread 0: Compute updates for leaves with indices 0..256 at depth 48 (only root 0 has changed). Eventually this results in a new root at depth 40, column 0.
    • More iterations like the 3rd until the root at depth 0 has been reached.

Is this accurate? Would it make sense to add something like this as a doc comment to compute_mutations_subtree (with corrections if it's inaccurate)?

@krushimir
Copy link
Author

10M entries tree.

batch insertions (10k inserts):
without smt_hashmaps: 383.3 ms (~38 μs per insert)
with smt_hashmaps: 281.9 ms (~28 μs per insert)
~26% faster
concurrent vs. sequential: 17.7x faster
concurrent with smt_hashmaps vs. sequential: 24.1x faster

batch updates (10k updates):
without smt_hashmaps: 287.9 ms (~29 μs per update)
without smt_hashmaps: 265.5 ms (~27 μs per update)
~8% faster
concurrent vs. sequential: 23.6x faster
concurrent with smt_hashmaps vs. sequential: 25.6x faster

Co-authored-by: Philipp Gackstatter <PhilippGackstatter@users.noreply.github.com>
@PhilippGackstatter
Copy link
Contributor

Hey @krushimir, quick question: Is this still Work-In-Progress or can it be marked as ready for review?

@krushimir
Copy link
Author

Hi @PhilippGackstatter, I'll push some more changes today and then I'll mark it ready.

@krushimir krushimir marked this pull request as ready for review January 23, 2025 07:12
@krushimir krushimir changed the title [WIP] implements concurrent Smt::compute_mutations Implements concurrent Smt::compute_mutations Jan 23, 2025
Copy link
Contributor

@PhilippGackstatter PhilippGackstatter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

krushimir and others added 2 commits January 23, 2025 17:32
Co-authored-by: Philipp Gackstatter <PhilippGackstatter@users.noreply.github.com>
@krushimir krushimir force-pushed the krushimir/subtree_mutations branch from 9242cff to e89daa9 Compare January 23, 2025 17:03
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! I left a couple of comments inline. The main one is about code organization - i.e., potentially moving the parallel mutation functions to the Smt struct.

Copy link

sonarqubecloud bot commented Feb 6, 2025

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you!

@bobbinth bobbinth merged commit 1b77fa8 into 0xPolygonMiden:next Feb 7, 2025
15 checks passed
@bobbinth
Copy link
Contributor

bobbinth commented Feb 7, 2025

On my machine (M1 Pro), I see the following results:

Single-threaded execution, smt_hashmaps enabled

Running a construction benchmark:
Constructed an SMT with 1000000 key-value pairs in 351.2 seconds
Number of leaf nodes: 1000000

Running an insertion benchmark:
The average insertion time measured by 1000 inserts into an SMT with 1000000 leaves is 694 μs

Running a batched insertion benchmark:
The average insert-batch computation time measured by a 1000-batch into an SMT with 1000000 leaves over 424.6 ms is 425 μs
The average insert-batch application time measured by a 1000-batch into an SMT with 1000000 leaves over 199.3 ms is 199 μs
The average batch insertion time measured by a 1000-batch into an SMT with 1000000 leaves totals to 623.9 ms

Running a batched update benchmark:
The average update-batch computation time measured by a 1000-batch into an SMT with 1000000 leaves over 414.2 ms is 414 μs
The average update-batch application time measured by a 1000-batch into an SMT with 1000000 leaves over 4.6 ms is 5 μs
The average batch update time measured by a 1000-batch into an SMT with 1000000 leaves totals to 418.8 ms

Running a proof generation benchmark:
The average proving time measured by 100 value proofs in an SMT with 1000000 leaves in 0 μs

Multi-threaded execution, smt_hashmaps enabled

Running a construction benchmark:
Constructed an SMT with 1000000 key-value pairs in 37.2 seconds
Number of leaf nodes: 1000000

Running an insertion benchmark:
The average insertion time measured by 1000 inserts into an SMT with 1000000 leaves is 610 μs

Running a batched insertion benchmark:
The average insert-batch computation time measured by a 1000-batch into an SMT with 1000000 leaves over 50.1 ms is 50 μs
The average insert-batch application time measured by a 1000-batch into an SMT with 1000000 leaves over 36.4 ms is 36 μs
The average batch insertion time measured by a 1000-batch into an SMT with 1000000 leaves totals to 86.5 ms

Running a batched update benchmark:
The average update-batch computation time measured by a 1000-batch into an SMT with 1000000 leaves over 51.1 ms is 51 μs
The average update-batch application time measured by a 1000-batch into an SMT with 1000000 leaves over 5.1 ms is 5 μs
The average batch update time measured by a 1000-batch into an SMT with 1000000 leaves totals to 56.2 ms

Running a proof generation benchmark:
The average proving time measured by 100 value proofs in an SMT with 1000000 leaves in 0 μs

Comparison

Benchmark Single-thread Multi-threaded Improvement
Construction (1M leaves) 351 sec 37 sec 9.5x
Insertion (1K batch) 623 ms 86 ms 7.2x
Updates (1K batch) 419 ms 56 ms 7.5x

@bobbinth
Copy link
Contributor

bobbinth commented Feb 7, 2025

And on M4 max, the results look like so:

Benchmark Single-thread Multi-threaded Improvement
Construction (1M leaves) 195 sec 15 sec 13x
Insertion (1K batch) 212 ms 28 ms 7.6x
Updates (1K batch) 218 ms 24 ms 9.1x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants