Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA: Adding cuPQC as a backend for ML-KEM. #2044

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

stevenireeves
Copy link

This PR adds the support for the NVIDIA library cuPQC to be used as the backend for ML-KEM algorithms.

cuPQC requires the use of an NVIDIA GPU to perform the PQC algorithms.

  • Does this PR change the input/output behaviour of a cryptographic algorithm (i.e., does it change known answer test values)? (If so, a version bump will be required from x.y.z to x.(y+1).0.)
  • Does this PR change the list of algorithms available -- either adding, removing, or renaming? Does this PR otherwise change an API? (If so, PRs in fully supported downstream projects dependent on these, i.e., oqs-provider will also need to be ready for review and merge by the time this is merged.)

Signed-off-by: Steven Reeves <sreeves@nvidia.com>
@stevenireeves
Copy link
Author

@praveksharma @neil-lindquist for visibility.

@stevenireeves
Copy link
Author

@praveksharma looks like a number of github actions are failing, can you help with this, or point to what needs to be changed?

@baentsch
Copy link
Member

@stevenireeves For the code formatting errors, please review this. For the basic test error (and error, actually), please open the "twisty" in the CI error report e.g. here at the "Configure" step to see the config command executed: You can run locally and see the problem: The new config variable apparently isn't initialized in all cases:
grafik

Signed-off-by: Steven Reeves <sreeves@nvidia.com>
@praveksharma
Copy link
Member

praveksharma commented Jan 15, 2025

@stevenireeves I've pushed 3 commits to https://github.com/open-quantum-safe/liboqs/tree/libOQSxcuPQC, which builds off of your branch, to get it to pass CI. I can't seem to push the changes directly to your fork.

Here's how to re-produce the changes:

  1. Commit 1:
export LIBOQS_DIR=`pwd`
cd ./scripts/copy_from_upstream
python3 ./copy_from_upstream.py copy
  1. Commit 2:
./tests/run_astyle.sh --no-dry-run
  1. Commit 3: replace spaces with tabs in scripts/copy_from_upstream/src/kem/family/kem_scheme.c as in commit.

…_####.c and kem/family/kem_scheme.c

Signed-off-by: Steven Reeves <sreeves@nvidia.com>
@stevenireeves stevenireeves marked this pull request as ready for review January 15, 2025 20:37
@praveksharma
Copy link
Member

Thank you for making these changes @stevenireeves! Since I reviewed the private PR initially I'll let this one be reviewed by reviewers other than myself.

@dstebila
Copy link
Member

I talked to @praveksharma just now to understand some of the approach here, and I understand why the cupqc metadata is patched into the PQ Crystals meta.yml file (since there isn't another meta.yml to patch it into); so that makes sense. But I think a few other pieces could be done differently. I don't think the *.cu files need to be added via the patch file; couldn't they just be added directly without also being added through the patch? And if so then could we give those directories a name that doesn't include pqcrystals -- e.g., just src/kem/ml_kem/cupqc-ml-kem-768?

@stevenireeves
Copy link
Author

I talked to @praveksharma just now to understand some of the approach here, and I understand why the cupqc metadata is patched into the PQ Crystals meta.yml file (since there isn't another meta.yml to patch it into); so that makes sense. But I think a few other pieces could be done differently. I don't think the *.cu files need to be added via the patch file; couldn't they just be added directly without also being added through the patch? And if so then could we give those directories a name that doesn't include pqcrystals -- e.g., just src/kem/ml_kem/cupqc-ml-kem-768?

@dstebila with this organizational change will we need to mess with the copy_from_upstream stuff as well?

@praveksharma
Copy link
Member

@dstebila with this organizational change will we need to mess with the copy_from_upstream stuff as well?

@stevenireeves I shall work on this and attemp to push the changes directly to your fork. I don't believe copy_from_upstream.py would need to be changed significantly.

@stevenireeves
Copy link
Author

@dstebila with this organizational change will we need to mess with the copy_from_upstream stuff as well?

@stevenireeves I shall work on this and attemp to push the changes directly to your fork. I don't believe copy_from_upstream.py would need to be changed significantly.

I have the allow edits from maintainers option selected. So should be able to make those changes.

Signed-off-by: Pravek Sharma <sharmapravek@gmail.com>
@praveksharma
Copy link
Member

I don't think the *.cu files need to be added via the patch file; couldn't they just be added directly without also being added through the patch? And if so then could we give those directories a name that doesn't include pqcrystals -- e.g., just src/kem/ml_kem/cupqc-ml-kem-768?

@dstebila handling the naming shouldn't be an issue. The *.cu file must be sourced with every run of copy_from_upstream.py under delete mode, since there is no upstream the *.cu must be sourced from within the liboqs repo - a patch is the most straightforward way of doing this. Would a separate patch file for the *.cu file be adequate?

@baentsch
Copy link
Member

I understand why the cupqc metadata is patched into the PQ Crystals meta.yml file (since there isn't another meta.yml to patch it into); so that makes sense

Hmm -- it doesn't to me: We have #2041 in the pipeline with the declared goal of removing the PQCrystals files. So this PR thus would need to be re-done after that landed, right? That doesn't seem sensible.

Wouldn't it be much more sensible to have this code be contained in an upstream of its own to pull it from? As far as I can see, this is effectively a different implementation with different license terms, characteristics etc. Such split also would make responsibilities clear: NVIDIA is to support the interfacing to its library (that may very well change over time unbeknownst to OQS) and OQS is responsible for the proper operation within/integration into the OQS APIs (that may change unbeknownst to NVIDIA). Or is the intention by NVIDIA to become committed, well committers and/or maintainers to OQS @stevenireeves ?

@stevenireeves
Copy link
Author

Hmm -- it doesn't to me: We have #2041 in the pipeline with the declared goal of removing the PQCrystals files. So this PR thus would need to be re-done after that landed, right? That doesn't seem sensible.

Wouldn't it be much more sensible to have this code be contained in an upstream of its own to pull it from? As far as I can see, this is effectively a different implementation with different license terms, characteristics etc. Such split also would make responsibilities clear: NVIDIA is to support the interfacing to its library (that may very well change over time unbeknownst to OQS) and OQS is responsible for the proper operation within/integration into the OQS APIs (that may change unbeknownst to NVIDIA). Or is the intention by NVIDIA to become committed, well committers and/or maintainers to OQS @stevenireeves ?

Nvidia maintaining it's own fork of liboqs is not what we want to do. The intention of NVIDIA is to support libOQS on portions of liboqs that utilize cuPQC as it's backend. So if there are significant changes that require alteration of the source files related to cupqc in liboqs, we will help there (so long as we have the internal support). I believe that @praveksharma is in the process of removing the cupqc metadata from PQ Crystals meta.yaml, correct me if I'm wrong.

@dstebila
Copy link
Member

dstebila commented Jan 16, 2025

I don't think the *.cu files need to be added via the patch file; couldn't they just be added directly without also being added through the patch? And if so then could we give those directories a name that doesn't include pqcrystals -- e.g., just src/kem/ml_kem/cupqc-ml-kem-768?

@dstebila handling the naming shouldn't be an issue. The *.cu file must be sourced with every run of copy_from_upstream.py under delete mode, since there is no upstream the *.cu must be sourced from within the liboqs repo - a patch is the most straightforward way of doing this. Would a separate patch file for the *.cu file be adequate?

Oh, I see now. I knew that we have some local implementations that --delete doesn't wipe out (e.g., FrodoKEM), but these algorithms aren't handled at all by copy_from_upstream. But I understand now that for algorithms that are handled by copy_from_upstream, there's no mixing of local (not deleted) and upstream (deleted) implementations.

So I see why the patch file is adding these.

@dstebila
Copy link
Member

Wouldn't it be much more sensible to have this code be contained in an upstream of its own to pull it from? As far as I can see, this is effectively a different implementation with different license terms, characteristics etc. Such split also would make responsibilities clear: NVIDIA is to support the interfacing to its library (that may very well change over time unbeknownst to OQS) and OQS is responsible for the proper operation within/integration into the OQS APIs (that may change unbeknownst to NVIDIA). Or is the intention by NVIDIA to become committed, well committers and/or maintainers to OQS @stevenireeves ?

Nvidia maintaining it's own fork of liboqs is not what we want to do. The intention of NVIDIA is to support libOQS on portions of liboqs that utilize cuPQC as it's backend. So if there are significant changes that require alteration of the source files related to cupqc in liboqs, we will help there (so long as we have the internal support). I believe that @praveksharma is in the process of removing the cupqc metadata from PQ Crystals meta.yaml, correct me if I'm wrong.

I think these are talking about two slightly different things. Michael isn't suggesting that NVIDIA maintain a separate fork of liboqs. The way our code has been structured is that we have scripts to pull in source code from (self-contained) implementations of algorithms in other repositories, and add them to liboqs using code generation and patches.

This could be done that way, but since it is also only one file that is being added, I don't think it's worth the effort of setting up a separate upstream for that.

@baentsch
Copy link
Member

Nvidia maintaining it's own fork of liboqs is not what we want to do.

This is not my ask: I only (meant to :) ask whether you'd want to maintain the wrapper code around the cuPQC code in a separate project -- complete with META.yml such as for copy_from_upstream to pull it in -- thus obviating the need for patching other implementations' META.yml files.

@SWilson4
Copy link
Member

I added the --delete functionality for copy_from_upstream quite recently. I can add an exception for the cuda files so that they don't get cleaned out; it would be quite straightforward and IMO would make maintenance of those files a lot easier. Editing patch files instead of source files is a pain. How does this sound @praveksharma?

@praveksharma
Copy link
Member

I added the --delete functionality for copy_from_upstream quite recently. I can add an exception for the cuda files so that they don't get cleaned out; it would be quite straightforward and IMO would make maintenance of those files a lot easier. Editing patch files instead of source files is a pain. How does this sound @praveksharma?

Thank you for offering @SWilson4 but I think it is more prudent to create a separate git repository to store cupqc_ml-kem metadata since the pqcrystals upstream is going to be deprecated soon in any case. If things work out okay @stevenireeves or someone else from Nvidia can ownership of the repo.

@stevenireeves
Copy link
Author

stevenireeves commented Jan 16, 2025

Thank you for offering @SWilson4 but I think it is more prudent to create a separate git repository to store cupqc_ml-kem metadata since the pqcrystals upstream is going to be deprecated soon in any case. If things work out okay @stevenireeves or someone else from Nvidia can ownership of the repo.

@praveksharma I am not sure what metadata you want us to store. Do you want us to write the wrapper code in an additional repository?

…metadata to separate upstream repo

Signed-off-by: Pravek Sharma <sharmapravek@gmail.com>
@praveksharma
Copy link
Member

I think it is more prudent to create a separate git repository to store cupqc_ml-kem metadata since the pqcrystals upstream is going to be deprecated soon in any case

I've updated the import mechanism to use https://github.com/praveksharma/cupqc-mlkem as an upstream. Ideally I would like to modify copy_from_upstream to not require ad hoc upstreams for situations such as this.

Could you please review the updated PR @SWilson4 @dstebila @baentsch ?

@stevenireeves
Copy link
Author

I've updated the import mechanism to use https://github.com/praveksharma/cupqc-mlkem as an upstream. Ideally I would like to modify copy_from_upstream to not require ad hoc upstreams for situations such as this.

Thanks @praveksharma

Copy link
Member

@baentsch baentsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM, @stevenireeves @praveksharma ; just some nits (see single comments). Also the question whether it would be worth while adding build documentation and a "tier 3" entry to PLATFORMS.md (as I guess we'll never be able to test this in CI, right?)

CONFIGURE.md Outdated
@@ -124,6 +125,13 @@ Dynamically load OpenSSL through `dlopen`. When using liboqs from other cryptogr

Only has an effect if the system supports `dlopen` and ELF binary format, such as Linux or BSD family.

### OQS_USE_CUPQC

Can be `ON` or `OFF`. When `ON`, use NVIDIA's cuPQC library where able (currently just ML-KEM). When this option is enabled, liboqs may not run correctly on machines that lack supported GPUs. To download cuPQC follow the instructions at (https://developer.nvidia.com/cupqc-download/). Detailed descriptions of the API, requirments, and installation guide are in the cuPQC documentation (https://docs.nvidia.com/cuda/cupqc/index.html).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo "requirements"

@@ -9,6 +9,10 @@
- **Primary Source**<a name="primary-source"></a>:
- **Source**: https://github.com/pq-crystals/kyber/commit/10b478fc3cc4ff6215eb0b6a11bd758bf0929cbd with copy_from_upstream patches
- **Implementation license (SPDX-Identifier)**: CC0-1.0 or Apache-2.0
- **Optimized Implementation sources**: https://github.com/pq-crystals/kyber/commit/10b478fc3cc4ff6215eb0b6a11bd758bf0929cbd with copy_from_upstream patches
- **cupqc-cuda**:<a name="cupqc-cuda"></a>
- **Source**: https://github.com/praveksharma/cupqc-mlkem/commit/adb8454e56979628c07b67eb7d90f9337be6dc30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to move to a non-personal GH repo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the goal @baentsch. Alternatively, I would like to modify to copy_from_upstream to store changes locally in the case of a minimal "upstream" such as this one.

@@ -38,6 +38,14 @@ upstreams:
kem_meta_path: '{pretty_name_full}_META.yml'
kem_scheme_path: '.'
patches: [pqcrystals-ml_kem.patch]
-
name: cupqc
git_url: https://github.com/praveksharma/cupqc-mlkem.git
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please eventually move to long-term maintained GH repo

Copy link
Member

@SWilson4 SWilson4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we'd need to get additional hardware to run CI tests on the CUDA code, but it is possible to add a build-only job?

docs/algorithms/kem/ml_kem.yml Outdated Show resolved Hide resolved
Signed-off-by: Pravek Sharma <sharmapravek@gmail.com>
@praveksharma
Copy link
Member

I know we'd need to get additional hardware to run CI tests on the CUDA code, but it is possible to add a build-only job?

I'm trying to do this on a maching without a Nvidia GPU and am running into trouble. Do you know if this is possible @stevenireeves?

Also the question whether it would be worth while adding build documentation and a "tier 3" entry to PLATFORMS.md (as I guess we'll never be able to test this in CI, right?)

I think we should do this. But I'm not familiar enough GPU architectures to know which specific platform to list. @stevenireeves, given cuPQC's requirements:

    CUDA Toolkit 12.4 or newer
    Supported CUDA compiler
    x86_64 CPU
    A NVIDIA GPU with one the following architectures: 70, 75, 80, 86, 89, 90

is listing "NVIDIA GPU architectures 70, 75, 80, 86, 89, and 90 with a x86_64 CPU" sufficient information for those in the know? For reference here are other Tier 3 platforms from PLATFORMS.md:

    x86 for Windows (Visual Studio Toolchain)
    ppc641e for Ubuntu (Focal)
    s390x for Ubuntu (Focal)
    loongarch64 for Debian Linux (trixie)

@stevenireeves
Copy link
Author

stevenireeves commented Jan 17, 2025

@praveksharma

I'm trying to do this on a maching without a Nvidia GPU and am running into trouble. Do you know if this is possible @stevenireeves?

I think if you have the NVIDIA toolkit 12.4 (even though the machine does not have a GPU) you should be able to build.

I think we should do this. But I'm not familiar enough GPU architectures to know which specific platform to list. @stevenireeves, given cuPQC's requirements:

    CUDA Toolkit 12.4 or newer
    Supported CUDA compiler
    x86_64 CPU
    A NVIDIA GPU with one the following architectures: 70, 75, 80, 86, 89, 90

is listing "NVIDIA GPU architectures 70, 75, 80, 86, 89, and 90 with a x86_64 CPU" sufficient information for those in the know? For reference here are other Tier 3 platforms from PLATFORMS.md:

cuPQC only support Linux with x86_64 currently. We only officially support archs 70-90 but I don't see why any GPU made post 70 wouldn't. I think we could say Linux NVIDIA GPU architectures 70, 75, 80, 86, 89, and 90 with a x86_64 CPU in that list and be good.

In the future we can update if we support more OS/CPUs and GPUs.

Signed-off-by: Pravek Sharma <sharmapravek@gmail.com>
@praveksharma
Copy link
Member

I know we'd need to get additional hardware to run CI tests on the CUDA code, but it is possible to add a build-only job?

I think if you have the NVIDIA toolkit 12.4 (even though the machine does not have a GPU) you should be able to build.

This would require the CI images to updated with NVIDIA toolkit so I'll get started on that.

@baentsch
Copy link
Member

This would require the CI images to updated with NVIDIA toolkit so I'll get started on that.

If there's no concrete desire to list this feature in a higher support tier (who wants that?) that'd be "future work" in my eyes, @praveksharma (or at least lower priority than many other open issues) and for me not a prerequisite for an approval of this PR.

Signed-off-by: Pravek Sharma <sharmapravek@gmail.com>
Signed-off-by: Pravek Sharma <sharmapravek@gmail.com>
Signed-off-by: Pravek Sharma <sharmapravek@gmail.com>
Signed-off-by: Pravek Sharma <sharmapravek@gmail.com>
Signed-off-by: Pravek Sharma <sharmapravek@gmail.com>
Signed-off-by: Pravek Sharma <sharmapravek@gmail.com>
@stevenireeves
Copy link
Author

@praveksharma looks like the build with OQS_USE_CUPQC is still failing in this PR. Although I can't see what is causing the failures. Did Neil's comments in the email help?

@praveksharma
Copy link
Member

praveksharma commented Jan 22, 2025

@praveksharma looks like the build with OQS_USE_CUPQC is still failing in this PR. Although I can't see what is causing the failures. Did Neil's comments in the email help?

@stevenireeves Yes, the comments were super helpful. I can succesfully build locally, this is mostly likely an issue with the CI config itself that I'm still troubleshooting.

Signed-off-by: Pravek Sharma <sharmapravek@gmail.com>
@praveksharma
Copy link
Member

@stevenireeves cmake in CI is failing with this message:

CMake Error at /usr/share/cmake-3.28/Modules/CMakeDetermineCUDACompiler.cmake:270 (message):
Failed to detect a default CUDA architecture.

I am unable to reproduce this error locally in a docker container provisioned from the same image used in CI. Do you know how to fix this?

@stevenireeves
Copy link
Author

stevenireeves commented Jan 23, 2025

CMake Error at /usr/share/cmake-3.28/Modules/CMakeDetermineCUDACompiler.cmake:270 (message):
Failed to detect a default CUDA architecture.

I am unable to reproduce this error locally in a docker container provisioned from the same image used in CI. Do you know how to fix this?

@praveksharma ah, because CI has no GPU it can't detect a default. Try adding this flag to the CMake command.
-DCMAKE_CUDA_ARCHITECTURES=80

Specifically for CI.

Signed-off-by: Pravek Sharma <sharmapravek@gmail.com>
@stevenireeves
Copy link
Author

@praveksharma looks like that CI test is still failing.

No CMAKE_CUDA_COMPILER could be found.

Tell CMake where to find the compiler by setting either the environment
variable "CUDACXX" or the CMake cache entry CMAKE_CUDA_COMPILER to the full
path to the compiler, or to the compiler name if it is in the PATH.

You'll need to tell CMAKE where nvcc is.

Signed-off-by: Pravek Sharma <sharmapravek@gmail.com>
Signed-off-by: Pravek Sharma <sharmapravek@gmail.com>
@stevenireeves
Copy link
Author

@praveksharma looks like that did the trick. Any other comments by the reviewers?

Copy link
Member

@SWilson4 SWilson4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with the caveat that we should eventually move away from a personal repo, as noted by @baentsch above.

@baentsch
Copy link
Member

LGTM, with the caveat that we should eventually move away from a personal repo, as noted by @baentsch above.

Agreed: Please do this now or create an issue, ideally assigned, so this is not forgotten.

@praveksharma
Copy link
Member

Agreed: Please do this now or create an issue, ideally assigned, so this is not forgotten.

@baentsch #2053 has been created to track this.

@stevenireeves do you want to move the upstream to your GitHub account?

@stevenireeves
Copy link
Author

stevenireeves commented Jan 24, 2025

@baentsch #2053 has been created to track this.

@stevenireeves do you want to move the upstream to your GitHub account?

@praveksharma
I think it's fine where it is now especially if we are just going to refactor copy from upstream so it's irrelevant. If we can it would be great to get this merged today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants