Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML-DSA: Optimize implementation #205

Closed
3 tasks done
marsella opened this issue Dec 10, 2024 · 3 comments · Fixed by #232
Closed
3 tasks done

ML-DSA: Optimize implementation #205

marsella opened this issue Dec 10, 2024 · 3 comments · Fixed by #232
Assignees
Labels
CNSA 2.0 improvement Addresses fixes or changes to existing specs

Comments

@marsella
Copy link
Contributor

marsella commented Dec 10, 2024

The version of ML-DSA that is currently being developed is very closely adherent to the spec. However, the spec is not written in a way that lends itself to fast Cryptol code. There are two particular types of slow-code that I've noticed:

  • direct indexing while operating over arrays; and
  • iterating + rebuilding arrays instead of using sequence-level operations (e.g. join, reverse, etc.)

In general, we don't want to delete the spec-adherent code, because spec adherence is a high-level goal of this repository. We can either make separate functions inline (e.g. BitsToBytes_fast) or make a separate module with the fast versions -- this might depend a little on the architecture decision we make in #198. In either case we need to prove equivalence between the spec version and the fast version.

Here are a few notes about things I've seen that could probably be faster:

  • IntegerToBits: we could use the built-in function fromInteger, with a reverse call to get the endianness right. Similarly with BitsToInteger and probably IntegerToBytes
  • BitsToBytes and BytesToBits: these should just be split and join calls.
  • All the BitPack functions (Alg 16 - 19) index into an array, but they could iterate directly over it instead.

  • Audit or benchmark the spec to identify areas for improvement (either based on Cryptol aesthetics or on some actual timing measurements). Put a list in this issue for future reference.
  • Decide where to put the "fast" versions.
  • Implement the optimized versions.
@marsella marsella added CNSA 2.0 improvement Addresses fixes or changes to existing specs labels Dec 10, 2024
@mariosge
Copy link
Contributor

mariosge commented Jan 28, 2025

I'm planning to implement the following optimizations:

  • IntegerToBits and BitsToInteger
  • IntegerToBytes (the inverse does not exist in the spec)
  • BitsToBytes and BytesToBits
  • BitPack and BitUnpack
  • SimpleBitPack and SimpleBitUnpack

As a first indicator, we are getting the speedups below. At a high level we are getting around 4-20x speedups.

Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time BytesToBits`{10} zero
Measuring for 10 seconds
Avg time: 367.2 μs    Avg CPU time: 367.4 μs    Avg cycles: 8813
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time BytesToBitsFast`{10} zero
Measuring for 10 seconds
Avg time: 22.28 μs    Avg CPU time: 22.29 μs    Avg cycles: 534
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time BitsToBytes`{10} zero
Measuring for 10 seconds
Avg time: 168.7 μs    Avg CPU time: 168.8 μs    Avg cycles: 4050
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time BitsToBytesFast`{10} zero
Measuring for 10 seconds
Avg time: 11.51 μs    Avg CPU time: 11.51 μs    Avg cycles: 276
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time IntegerToBytes`{10} zero
Measuring for 10 seconds
Avg time: 45.47 μs    Avg CPU time: 45.49 μs    Avg cycles: 1091
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time IntegerToBytesFast`{10} zero
Measuring for 10 seconds
Avg time: 5.264 μs    Avg CPU time: 5.266 μs    Avg cycles: 126
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time BitsToInteger`{10} zero
Measuring for 10 seconds
Avg time: 69.28 μs    Avg CPU time: 69.31 μs    Avg cycles: 1662
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time BitsToIntegerFast`{10} zero
Measuring for 10 seconds
Avg time: 3.719 μs    Avg CPU time: 3.721 μs    Avg cycles: 89
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time IntegerToBits`{10} zero
Measuring for 10 seconds
Avg time: 56.24 μs    Avg CPU time: 56.25 μs    Avg cycles: 1349
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time IntegerToBitsFast`{10} zero
Measuring for 10 seconds
Avg time: 3.727 μs    Avg CPU time: 3.727 μs    Avg cycles: 89
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> 
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> 
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time BitUnpack`{10, 10} zero
Measuring for 10 seconds
Avg time: 20.94 ms    Avg CPU time: 20.95 ms    Avg cycles: 502485
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time BitUnpackFast`{10, 10} zero
Measuring for 10 seconds
Avg time: 3.027 ms    Avg CPU time: 3.028 ms    Avg cycles: 72648
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time SimpleBitUnpack`{10} zero
Measuring for 10 seconds
Avg time: 17.39 ms    Avg CPU time: 17.39 ms    Avg cycles: 417271
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time SimpleBitUnpackFast`{10} zero
Measuring for 10 seconds
Avg time: 1.912 ms    Avg CPU time: 1.912 ms    Avg cycles: 45884
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time SimpleBitPack`{10} zero
Measuring for 10 seconds
Avg time: 26.45 ms    Avg CPU time: 26.47 ms    Avg cycles: 634915
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time SimpleBitPackFast`{10} zero
Measuring for 10 seconds
Avg time: 2.727 ms    Avg CPU time: 2.728 ms    Avg cycles: 65445
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time BitPack`{10, 10} zero
Measuring for 10 seconds
Avg time: 33.71 ms    Avg CPU time: 33.72 ms    Avg cycles: 808983
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time BitPackFast`{10, 10} zero
Measuring for 10 seconds
Avg time: 3.527 ms    Avg CPU time: 3.528 ms    Avg cycles: 84650
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> 
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> 
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> 
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time sigDecode zero
Measuring for 10 seconds
Avg time: 302.4 ms    Avg CPU time: 302.5 ms    Avg cycles: 7256612
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time sigDecodeFast zero
Measuring for 10 seconds
Avg time: 78.95 ms    Avg CPU time: 78.97 ms    Avg cycles: 1894814
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time w1Encode zero
Measuring for 10 seconds
Avg time: 158.2 ms    Avg CPU time: 158.2 ms    Avg cycles: 3795854
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time w1EncodeFast zero
Measuring for 10 seconds
Avg time: 13.40 ms    Avg CPU time: 13.41 ms    Avg cycles: 321677
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time sigEncode zero zero zero
Measuring for 10 seconds
Avg time: 443.3 ms    Avg CPU time: 443.5 ms    Avg cycles: 10638975
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time sigEncodeFast zero zero zero
Measuring for 10 seconds
Avg time: 55.62 ms    Avg CPU time: 55.65 ms    Avg cycles: 1334935
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time skDecode zero
Measuring for 10 seconds
Avg time: 330.0 ms    Avg CPU time: 330.0 ms    Avg cycles: 7919656
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time skDecodeFast zero
Measuring for 10 seconds
Avg time: 64.47 ms    Avg CPU time: 64.48 ms    Avg cycles: 1547283
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time skEncode zero zero zero zero zero zero
Measuring for 10 seconds
Avg time: 492.3 ms    Avg CPU time: 492.5 ms    Avg cycles: 11814406
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time skEncodeFast zero zero zero zero zero zero
Measuring for 10 seconds
Avg time: 54.96 ms    Avg CPU time: 54.96 ms    Avg cycles: 1319159
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time pkDecode zero
Measuring for 10 seconds
Avg time: 155.3 ms    Avg CPU time: 155.4 ms    Avg cycles: 3728245
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time pkDecodeFast zero
Measuring for 10 seconds
Avg time: 25.63 ms    Avg CPU time: 25.63 ms    Avg cycles: 615218
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time pkEncode zero zero
Measuring for 10 seconds
Avg time: 250.4 ms    Avg CPU time: 250.5 ms    Avg cycles: 6008699
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time pkEncodeFast zero zero
Measuring for 10 seconds
Avg time: 20.49 ms    Avg CPU time: 20.50 ms    Avg cycles: 491736
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time SimpleBitPack`{10} zero
Measuring for 10 seconds
Avg time: 27.09 ms    Avg CPU time: 27.10 ms    Avg cycles: 650237
Primitive::Asymmetric::Signature::ML_DSA::Instantiations::ML_DSA_44> :time SimpleBitPackFast`{10} zero
Measuring for 10 seconds
Avg time: 2.388 ms    Avg CPU time: 2.389 ms    Avg cycles: 57325

@mariosge mariosge linked a pull request Jan 28, 2025 that will close this issue
@marsella
Copy link
Contributor Author

marsella commented Jan 30, 2025

In the sample runtime you pasted, the encoding functions are only taking one parameter, but they should take more than that. Calling :time sigEncode zero partially applies the function but doesn't execute it -- we'd need :time sigEncode zero zero zero. I didn't check all of them but I'm seeing a small improvement. Sig encode has 4xBitPacks, though, so this is less that I anticipated.

> :time sigEncode_fast zero zero zero
Measuring for 10 seconds
Avg time: 505.6 ms    Avg CPU time: 505.8 ms    Avg cycles: 12135307
> :time sigEncode zero zero zero
Measuring for 10 seconds
Avg time: 507.5 ms    Avg CPU time: 507.9 ms    Avg cycles: 12179146

Edit: I think this is a one-off bug in sigEncode that I'm flagging in #232, the other ones I suspect are faster!

@mariosge
Copy link
Contributor

Indeed, thanks for pointing this out! I've updated the comment with the correct results. I see at least a 9x speedup on all encode functions now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CNSA 2.0 improvement Addresses fixes or changes to existing specs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants