Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import ML-KEM from mlkem-native/PQ code package #2041

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

bhess
Copy link
Member

@bhess bhess commented Jan 13, 2025

This PR tracks the integration of ML-KEM from the mlkem-native upstream repository.
It replaces the current ML-KEM implementation in liboqs, which was previously imported from pq-crystals, with the mlkem-native implementation from PQCP.

Some features of mlkem-native:

  • Portable C implementation (C90 compliant)
  • Optimized implementation for x86_64
  • Optimized implementation for ARM64
  • Formal verification

The upstream code recently had a v1.0.0-alpha release and is actively maintained. The goal is to synchronize the PR with an upcoming tagged release of mlkem-native.

Additionally, the upstream code includes enhanced key validation as defined by FIPS 203 by default, which resolves issue #1951.

Closes #1951.

TODOs:

  • Sync with the upcoming release version of mlkem-native
  • Update constant-time tests
  • Update documentation
  • Does this PR change the input/output behaviour of a cryptographic algorithm (i.e., does it change known answer test values)? (If so, a version bump will be required from x.y.z to x.(y+1).0.)
  • Does this PR change the list of algorithms available -- either adding, removing, or renaming? Does this PR otherwise change an API? (If so, PRs in fully supported downstream projects dependent on these, i.e., oqs-provider will also need to be ready for review and merge by the time this is merged.)

Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
bhess added 4 commits January 21, 2025 15:10
Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
@bhess bhess marked this pull request as ready for review January 21, 2025 16:24
Copy link
Member

@baentsch baentsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @bhess. I surely didn't check all 540 files but focused on the integration logic: Please see the single comments. In general, the patch is way too large in my opinion: Isn't it possible that the upstream uses fewer hard-coded include paths and also provides a YML documentation of their implementation? "copy_from_upstream" ideally should be easy to run to regularly follow the upstream without the need to always create new patches: the latter only creates unnecessary work for OQS and consequently reduces the motivation for keeping the code up-to-date. Of course, if there is no further development expected in PQCP (is it?) this point is moot.

docs/cbom.json Outdated Show resolved Hide resolved
scripts/copy_from_upstream/patches/mlkem-native.patch Outdated Show resolved Hide resolved
tests/constant_time/kem/passes/ml_kem Outdated Show resolved Hide resolved
bhess added 2 commits January 22, 2025 16:27
Copy-from-upstream option to preserve folder stucture
Smaller patch: no include paths fixing & meta-ymls available upstream
Documenting ct-passes file
Update dependencies for CBOM
[full tests] [extended tests]

Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
xof_x4_ctx statex;
unsigned int buflen;

+ shake128x4_inc_init(&statex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm -- is the upstream API for shake so much different that it doesn't need initialization? Or is this a real functional patch that should be upstreamed? Anyway, thanks for streamlining the patch overall @bhess!

Copy link

@hanno-becker hanno-becker Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the upstream API for absorb is one-shot and includes initialization.

But still, @bhess could this be addressed in the fips202x4.h glue code instead? That might be a more natural place to bridge between one-shot and incremental API than a patch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hanno-becker - I'll review that part.

Copy link
Member Author

@bhess bhess Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But still, @bhess could this be addressed in the fips202x4.h glue code instead? That might be a more natural place to bridge between one-shot and incremental API than a patch.

Our absorb function is also one-shot in the sense that it resets the context, performs absorb, and finalizes the context. However, we may want to reuse the context for another operation later.

The difference in our implementation compared to yours is that we allocate the structure on the heap. In this case, we explicitly perform this allocation once, before running absorb and squeeze.

Would it be possible to add this initialization step to the upstream code as well? For your implementation, it would effectively be a nop, while for us, it could handle the allocation without requiring a patch.

Copy link

@hanno-becker hanno-becker Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhess Can you elaborate a bit further why this cannot be done in the glue code? It is expected that the FIPS202 API used by mlkem-native may not exactly match the one of a consuming application bringing its own FIPS202, but in that case, the expectation is that fips202[x4].h contains shims mapping the (more specialized) mlkem-native calls to the (likely more general) underlying FIPS202 implementation.

We added examples/bring_your_own_fips202 which exemplifies this in the example of tiny_sha3. tiny_sha3's API is also not exactly what's needed for mlkem-native, and there's glue code mapping between them.

I'm not in principle opposed to making further changes, but would like to understand better why the API-difference cannot be handled like this in your case.

Copy link

@hanno-becker hanno-becker Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SWilson4 Thank you for chiming in! How would a failure be raised, though? One would need to bubble it up the call chain to the top-level API, and change all function signatures along the way. For uniformity, one would also need to change all other places where hashing is used. Is that what you had in mind? I'm hesitant about such change on first thought, but will chat with Matthias. This is not specific to and not a requirement for this PR though, right? Is there already an issue / feature request?

@bhess As I understand, the only issue here is that at present a single shim is used for multiple implementations? I didn't expect that. WIth a shim merely for mlkem-native, you could just allocate in the shim for shake128_absorb_once function. This would not affect the ability to do repeated absorb+squeeze in other contexts. Could you confirm that understanding?

That said, we already have an explicit xxx_release() function anyway, so it only makes the API more symmetric to also offer an init. I opened pq-code-package/mlkem-native#686. Update: This is now merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only issue here is that at present a single shim is used for multiple implementations? I didn't expect that.

Can I please ask you to do this going forward then, @hanno-becker ? IMO this is exactly what OQS is and does at its core: A mechanism to bundle together all algorithms in the same way, both "from above" (for others to use) as well as "from below" (providing the same common code for all algorithms). Changing that (doing specific per-algorithm changes) voids the purpose of OQS or at least, renders it unmaintainable as the small team cannot possibly be specialists in all constituent algorithms "below" as well as support all kinds of user wishes "from above". Or phrased differently: It is absolutely clear that algorithm-specific integrations are much easier to maintain and more performant (as is being documented publicly right now by openssl and less publicly by the many proprietary PQC implementations springing up), but that's not what OQS is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, we already have an explicit xxx_release() function anyway, so it only makes the API more symmetric to also offer an init. I opened pq-code-package/mlkem-native#686. Update: This is now merged.

Great, thank you very much!

As an add-on to this, it would be really nice if there were an option in mlkem-native to error-check SHA3 / RNG code. Our strategy in liboqs right now is to do a hard exit on a malloc failure, and it would be great if we could instead check a return value and exit gracefully.

Fully agree with that catching these errors gracefully would be desireable ! However, I believe error handling with randombytes would be a concern in the future (#1750), as our current API doesn’t support this.

One example is the MAYO implementation, which supports a compile flag, HAVE_RANDOMBYTES_NORETVAL. This flag allows switching between randombytes implementations that either return a code or don’t. Perhaps adopting a similar approach could be beneficial for mlkem-native as well.

#if defined(PQM4) || defined(HAVE_RANDOMBYTES_NORETVAL)
randombytes(tmp + param_digest_bytes, param_salt_bytes);
#else
if (randombytes(tmp + param_digest_bytes, param_salt_bytes) != MAYO_OK) {
ret = MAYO_ERR;
goto err;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve updated the PR to incorporate changes from upstream (pq-code-package/mlkem-native#686).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SWilson4 Thank you for chiming in! How would a failure be raised, though? One would need to bubble it up the call chain to the top-level API, and change all function signatures along the way. For uniformity, one would also need to change all other places where hashing is used. Is that what you had in mind? I'm hesitant about such change on first thought, but will chat with Matthias. This is not specific to and not a requirement for this PR though, right? Is there already an issue / feature request?

Nope, not a requirement for this PR, but IMO it would be a major improvement for liboqs. A couple of related issues in the backlog are #1456 and #1750.

@bhess
Copy link
Member Author

bhess commented Jan 22, 2025

Thanks for the review @baentsch. The patch size is now much reduced, basically only to adapt a few things to be able to use our fips202/sha3 implementation. For the upstream implementation it seems not straight-forward to move away from relative import paths. However, this is no longer an issue because I’ve added an option to copy_from_upstream that preserves the upstream folder structure. As a result, no further patching is required.

@baentsch baentsch self-requested a review January 23, 2025 08:12
@baentsch baentsch dismissed their stale review January 23, 2025 08:13

Comments addressed. Discussion ongoing. Don't want to hinder other approvals moving things forward.

…ended tests]

Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
Copy link
Member

@SWilson4 SWilson4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't attempted to review the code imported from PQCP (and I wouldn't have the expertise to do so anyhow), but the integration-related code looks good to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's time to rename this file to "upstream_shims" or something similar to reflect the fact that it's no longer exclusive to PQClean?

@baentsch
Copy link
Member

Quick additional question: Could you share a performance comparison run on the same machine "before-after", @bhess? Should help avoid things like #2047. Thanks in advance!

@bhess
Copy link
Member Author

bhess commented Jan 24, 2025

Quick additional question: Could you share a performance comparison run on the same machine "before-after", @bhess? Should help avoid things like #2047. Thanks in advance!

The following measurements are on an Intel Xeon Gold 6338 CPU @ 2.00GHz, Turbo Boost turned off for consistent results:

  1. Generic implementation
Configuration info
==================
Target platform:  x86_64-Linux-6.1.19
Compiler:         gcc (11.4.0)
Compile options:  [-Wa,--noexecstack;-O3;-fomit-frame-pointer;-fdata-sections;-ffunction-sections;-Wl,--gc-sections;-Wbad-function-cast]
OQS version:      0.12.1-dev (major: 0, minor: 12, patch: 1, pre-release: -dev)
OpenSSL enabled:  Yes (OpenSSL 3.4.0 22 Oct 2024)
AES:              OpenSSL
SHA-2:            OpenSSL
SHA-3:            C
OQS build flags:  OQS_LIBJADE_BUILD OQS_OPT_TARGET=generic CMAKE_BUILD_TYPE=Release 
CPU exts compile-time:  SSE SSE2

1.1. Old implementation from main

Speed test
==========
Started at 2025-01-24 09:09:54
Operation                            | Iterations | Total time (s) | Time (us): mean | pop. stdev | CPU cycles: mean          | pop. stdev
------------------------------------ | ----------:| --------------:| ---------------:| ----------:| -------------------------:| ----------:
ML-KEM-512                           |            |                |                 |            |                           |           
keygen                               |      54132 |          3.000 |          55.420 |     10.625 |                    110759 |      21232
encaps                               |      47809 |          3.000 |          62.751 |      0.584 |                    125417 |        800
decaps                               |      37771 |          3.000 |          79.428 |      0.748 |                    158772 |       1211
ML-KEM-768                           |            |                |                 |            |                           |           
keygen                               |      33523 |          3.000 |          89.491 |     12.123 |                    178901 |      24227
encaps                               |      30855 |          3.000 |          97.232 |      0.715 |                    194379 |       1202
decaps                               |      24982 |          3.000 |         120.088 |      0.801 |                    240104 |       1376
ML-KEM-1024                          |            |                |                 |            |                           |           
keygen                               |      21620 |          3.000 |         138.764 |     14.656 |                    277453 |      29299
encaps                               |      20920 |          3.000 |         143.405 |      0.837 |                    286731 |       1475
decaps                               |      17398 |          3.000 |         172.439 |      0.858 |                    344799 |       1505

1.2 mlkem-native implementation

Started at 2025-01-24 09:11:46
Operation                            | Iterations | Total time (s) | Time (us): mean | pop. stdev | CPU cycles: mean          | pop. stdev
------------------------------------ | ----------:| --------------:| ---------------:| ----------:| -------------------------:| ----------:
ML-KEM-512                           |            |                |                 |            |                           |           
keygen                               |      70652 |          3.000 |          42.462 |      8.883 |                     84854 |      17746
encaps                               |      65996 |          3.000 |          45.458 |      0.582 |                     90836 |        678
decaps                               |      54439 |          3.000 |          55.108 |      0.509 |                    110144 |        808
ML-KEM-768                           |            |                |                 |            |                           |           
keygen                               |      43475 |          3.000 |          69.006 |     10.748 |                    137944 |      21482
encaps                               |      42360 |          3.000 |          70.823 |      0.783 |                    141566 |       1295
decaps                               |      35461 |          3.000 |          84.602 |      0.791 |                    169128 |       1312
ML-KEM-1024                          |            |                |                 |            |                           |           
keygen                               |      28875 |          3.000 |         103.900 |     12.978 |                    207728 |      25950
encaps                               |      28994 |          3.000 |         103.471 |      0.817 |                    206867 |       1409
decaps                               |      24693 |          3.000 |         121.496 |      0.869 |                    242923 |       1537

-> We see a nice speedup in the generic code

  1. Optimized implementation (Intel)
Configuration info
==================
Target platform:  x86_64-Linux-6.1.19
Compiler:         gcc (11.4.0)
Compile options:  [-march=native;-Wa,--noexecstack;-O3;-fomit-frame-pointer;-fdata-sections;-ffunction-sections;-Wl,--gc-sections;-Wbad-function-cast]
OQS version:      0.12.1-dev (major: 0, minor: 12, patch: 1, pre-release: -dev)
OpenSSL enabled:  Yes (OpenSSL 3.4.0 22 Oct 2024)
AES:              NI
SHA-2:            OpenSSL
SHA-3:            C
OQS build flags:  OQS_LIBJADE_BUILD OQS_OPT_TARGET=auto CMAKE_BUILD_TYPE=Release 
CPU exts compile-time:  ADX AES AVX AVX2 AVX512 BMI1 BMI2 PCLMULQDQ POPCNT SSE SSE2 SSE3

2.1 Old implementation from main:

Started at 2025-01-24 09:08:06
Operation                            | Iterations | Total time (s) | Time (us): mean | pop. stdev | CPU cycles: mean          | pop. stdev
------------------------------------ | ----------:| --------------:| ---------------:| ----------:| -------------------------:| ----------:
ML-KEM-512                           |            |                |                 |            |                           |           
keygen                               |     249319 |          3.000 |          12.033 |      4.876 |                     23992 |       9740
encaps                               |     237728 |          3.000 |          12.619 |      0.514 |                     25158 |        340
decaps                               |     262800 |          3.000 |          11.416 |      0.515 |                     22763 |        317
ML-KEM-768                           |            |                |                 |            |                           |           
keygen                               |     154496 |          3.000 |          19.418 |      5.696 |                     38764 |      11357
encaps                               |     156005 |          3.000 |          19.230 |      0.462 |                     38380 |        430
decaps                               |     163952 |          3.000 |          18.298 |      0.504 |                     36528 |        457
ML-KEM-1024                          |            |                |                 |            |                           |           
keygen                               |     116438 |          3.000 |          25.765 |      6.358 |                     51462 |      12682
encaps                               |     116221 |          3.000 |          25.813 |      0.479 |                     51538 |        526
decaps                               |     119327 |          3.000 |          25.141 |      0.530 |                     50212 |        853

2.2 mlkem-native implementation

Started at 2025-01-24 09:13:17
Operation                            | Iterations | Total time (s) | Time (us): mean | pop. stdev | CPU cycles: mean          | pop. stdev
------------------------------------ | ----------:| --------------:| ---------------:| ----------:| -------------------------:| ----------:
ML-KEM-512                           |            |                |                 |            |                           |           
keygen                               |     247062 |          3.000 |          12.143 |      4.799 |                     24213 |       9580
encaps                               |     181257 |          3.000 |          16.551 |      3.701 |                     33013 |       7367
decaps                               |     154251 |          3.000 |          19.449 |      0.740 |                     38825 |       1103
ML-KEM-768                           |            |                |                 |            |                           |           
keygen                               |     150155 |          3.000 |          19.979 |      5.763 |                     39889 |      11505
encaps                               |     141092 |          3.000 |          21.263 |      0.497 |                     42443 |        503
decaps                               |     112058 |          3.000 |          26.772 |      0.499 |                     53474 |        515
ML-KEM-1024                          |            |                |                 |            |                           |           
keygen                               |     113681 |          3.000 |          26.390 |      6.629 |                     52710 |      13232
encaps                               |     103268 |          3.000 |          29.051 |      0.358 |                     58029 |        532
decaps                               |      81610 |          3.000 |          36.760 |      0.545 |                     73446 |        652

-> The key generation performance is very similar, but there's some performance degradation in encapsulation/decapsulation. This can likely be attributed to the additional key checks implemented in mlkem-native to meet FIPS203 requirements, which are more noticeable in the otherwise optimized code. Feedback from @mkannwischer would be appreciated to confirm if this aligns with your expectations.

@mkannwischer
Copy link

Quick additional question: Could you share a performance comparison run on the same machine "before-after", @bhess? Should help avoid things like #2047. Thanks in advance!

The following measurements are on an Intel Xeon Gold 6338 CPU @ 2.00GHz, Turbo Boost turned off for consistent results:

  1. Generic implementation
Configuration info
==================
Target platform:  x86_64-Linux-6.1.19
Compiler:         gcc (11.4.0)
Compile options:  [-Wa,--noexecstack;-O3;-fomit-frame-pointer;-fdata-sections;-ffunction-sections;-Wl,--gc-sections;-Wbad-function-cast]
OQS version:      0.12.1-dev (major: 0, minor: 12, patch: 1, pre-release: -dev)
OpenSSL enabled:  Yes (OpenSSL 3.4.0 22 Oct 2024)
AES:              OpenSSL
SHA-2:            OpenSSL
SHA-3:            C
OQS build flags:  OQS_LIBJADE_BUILD OQS_OPT_TARGET=generic CMAKE_BUILD_TYPE=Release 
CPU exts compile-time:  SSE SSE2

1.1. Old implementation from main

Speed test
==========
Started at 2025-01-24 09:09:54
Operation                            | Iterations | Total time (s) | Time (us): mean | pop. stdev | CPU cycles: mean          | pop. stdev
------------------------------------ | ----------:| --------------:| ---------------:| ----------:| -------------------------:| ----------:
ML-KEM-512                           |            |                |                 |            |                           |           
keygen                               |      54132 |          3.000 |          55.420 |     10.625 |                    110759 |      21232
encaps                               |      47809 |          3.000 |          62.751 |      0.584 |                    125417 |        800
decaps                               |      37771 |          3.000 |          79.428 |      0.748 |                    158772 |       1211
ML-KEM-768                           |            |                |                 |            |                           |           
keygen                               |      33523 |          3.000 |          89.491 |     12.123 |                    178901 |      24227
encaps                               |      30855 |          3.000 |          97.232 |      0.715 |                    194379 |       1202
decaps                               |      24982 |          3.000 |         120.088 |      0.801 |                    240104 |       1376
ML-KEM-1024                          |            |                |                 |            |                           |           
keygen                               |      21620 |          3.000 |         138.764 |     14.656 |                    277453 |      29299
encaps                               |      20920 |          3.000 |         143.405 |      0.837 |                    286731 |       1475
decaps                               |      17398 |          3.000 |         172.439 |      0.858 |                    344799 |       1505

1.2 mlkem-native implementation

Started at 2025-01-24 09:11:46
Operation                            | Iterations | Total time (s) | Time (us): mean | pop. stdev | CPU cycles: mean          | pop. stdev
------------------------------------ | ----------:| --------------:| ---------------:| ----------:| -------------------------:| ----------:
ML-KEM-512                           |            |                |                 |            |                           |           
keygen                               |      70652 |          3.000 |          42.462 |      8.883 |                     84854 |      17746
encaps                               |      65996 |          3.000 |          45.458 |      0.582 |                     90836 |        678
decaps                               |      54439 |          3.000 |          55.108 |      0.509 |                    110144 |        808
ML-KEM-768                           |            |                |                 |            |                           |           
keygen                               |      43475 |          3.000 |          69.006 |     10.748 |                    137944 |      21482
encaps                               |      42360 |          3.000 |          70.823 |      0.783 |                    141566 |       1295
decaps                               |      35461 |          3.000 |          84.602 |      0.791 |                    169128 |       1312
ML-KEM-1024                          |            |                |                 |            |                           |           
keygen                               |      28875 |          3.000 |         103.900 |     12.978 |                    207728 |      25950
encaps                               |      28994 |          3.000 |         103.471 |      0.817 |                    206867 |       1409
decaps                               |      24693 |          3.000 |         121.496 |      0.869 |                    242923 |       1537

-> We see a nice speedup in the generic code

  1. Optimized implementation (Intel)
Configuration info
==================
Target platform:  x86_64-Linux-6.1.19
Compiler:         gcc (11.4.0)
Compile options:  [-march=native;-Wa,--noexecstack;-O3;-fomit-frame-pointer;-fdata-sections;-ffunction-sections;-Wl,--gc-sections;-Wbad-function-cast]
OQS version:      0.12.1-dev (major: 0, minor: 12, patch: 1, pre-release: -dev)
OpenSSL enabled:  Yes (OpenSSL 3.4.0 22 Oct 2024)
AES:              NI
SHA-2:            OpenSSL
SHA-3:            C
OQS build flags:  OQS_LIBJADE_BUILD OQS_OPT_TARGET=auto CMAKE_BUILD_TYPE=Release 
CPU exts compile-time:  ADX AES AVX AVX2 AVX512 BMI1 BMI2 PCLMULQDQ POPCNT SSE SSE2 SSE3

2.1 Old implementation from main:

Started at 2025-01-24 09:08:06
Operation                            | Iterations | Total time (s) | Time (us): mean | pop. stdev | CPU cycles: mean          | pop. stdev
------------------------------------ | ----------:| --------------:| ---------------:| ----------:| -------------------------:| ----------:
ML-KEM-512                           |            |                |                 |            |                           |           
keygen                               |     249319 |          3.000 |          12.033 |      4.876 |                     23992 |       9740
encaps                               |     237728 |          3.000 |          12.619 |      0.514 |                     25158 |        340
decaps                               |     262800 |          3.000 |          11.416 |      0.515 |                     22763 |        317
ML-KEM-768                           |            |                |                 |            |                           |           
keygen                               |     154496 |          3.000 |          19.418 |      5.696 |                     38764 |      11357
encaps                               |     156005 |          3.000 |          19.230 |      0.462 |                     38380 |        430
decaps                               |     163952 |          3.000 |          18.298 |      0.504 |                     36528 |        457
ML-KEM-1024                          |            |                |                 |            |                           |           
keygen                               |     116438 |          3.000 |          25.765 |      6.358 |                     51462 |      12682
encaps                               |     116221 |          3.000 |          25.813 |      0.479 |                     51538 |        526
decaps                               |     119327 |          3.000 |          25.141 |      0.530 |                     50212 |        853

2.2 mlkem-native implementation

Started at 2025-01-24 09:13:17
Operation                            | Iterations | Total time (s) | Time (us): mean | pop. stdev | CPU cycles: mean          | pop. stdev
------------------------------------ | ----------:| --------------:| ---------------:| ----------:| -------------------------:| ----------:
ML-KEM-512                           |            |                |                 |            |                           |           
keygen                               |     247062 |          3.000 |          12.143 |      4.799 |                     24213 |       9580
encaps                               |     181257 |          3.000 |          16.551 |      3.701 |                     33013 |       7367
decaps                               |     154251 |          3.000 |          19.449 |      0.740 |                     38825 |       1103
ML-KEM-768                           |            |                |                 |            |                           |           
keygen                               |     150155 |          3.000 |          19.979 |      5.763 |                     39889 |      11505
encaps                               |     141092 |          3.000 |          21.263 |      0.497 |                     42443 |        503
decaps                               |     112058 |          3.000 |          26.772 |      0.499 |                     53474 |        515
ML-KEM-1024                          |            |                |                 |            |                           |           
keygen                               |     113681 |          3.000 |          26.390 |      6.629 |                     52710 |      13232
encaps                               |     103268 |          3.000 |          29.051 |      0.358 |                     58029 |        532
decaps                               |      81610 |          3.000 |          36.760 |      0.545 |                     73446 |        652

-> The key generation performance is very similar, but there's some performance degradation in encapsulation/decapsulation. This can likely be attributed to the additional key checks implemented in mlkem-native to meet FIPS203 requirements, which are more noticeable in the otherwise optimized code. Feedback from @mkannwischer would be appreciated to confirm if this aligns with your expectations.

Thanks for the benchmarks.
No, this is weird. The performance impact of input validation is expected to be around 1% for encaps and maybe 20% for decaps. That doesn't match what you are seeing, so something else must be going on in addition.
I was able to reproduce some of the weirdness you are seeing on a Cascade Lake just now. I will get back to you when I found out what's going on there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ML-KEM doesn't perform encapsulation key check
5 participants