Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Example use of NAMD CudaGlobalMaster interface #783

Draft
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

HanatoK
Copy link
Member

@HanatoK HanatoK commented Mar 20, 2025

Hi @giacomofiorin and @jhenin ! This is an example use of the NAMD's CudaGlobalMaster interface with Colvars. The data exchanged with CudaGlobalMaster are supposed to be GPU-resident. Due to Colvars' limitations the implementation still have to copy the data from GPU to CPU, but I think it is the first step to make Colvars GPU-resident (or at least a test-bed for the GPU porting).

Compilation

To test the new interface, you need to run the following commands to build the interface with Colvars as a shared library:

cd namd_cudaglobalmaster/
mkdir build
cd build
cmake -DNAMD_DIR=<YOUR_NAMD_SOURCE_CODE_DIRECTORY> ../
make -j2

Currently the dependencies include NAMD, the Colvars itself, and CUDA runtime. Also, due to recent NAMD changes regarding to the unified reductions, the CudaGlobalMaster is broken, you need to switch to the fix_cudagm_reduction branch to build the interface and NAMD itself (or wait for https://gitlab.com/tcbgUIUC/namd/-/merge_requests/398 being merged).

Example usage

The example NAMD input file can be found in namd_cudaglobalmaster/example/alad.namd, which dynamically loads the shared library built above and runs an OPES simulation along the two dihedral angles of the alanine dipeptide.

Limitations

  • Colvars does not provide any interface to notify the MD engines that it has finished changing the atom selection, so I have to reallocate all buffers in case of init_atom and clear_atom;
  • Most of the interface code are copied from the existing colvarproxy_namd.*, but I am still not sure why some functions like update_target_temperature(), update_engine_parameters(), setup_input() and setup_output() seem to be called multiple times there;
  • CudaGlobalMaster copies the atoms to the buffers in xxxyyyzzz format as discussed in GPU preparation work #652. However, Colvars still uses xyzxyzxyz so I have to transform the arrays in the interface code;
  • SMP is disabled, as it is conflicted with the goal of GPU-resident;
  • volmap is not available;
  • More tests are needed;
  • The build file namd_cudaglobalmaster/CMakeLists.txt should add Colvars by add_subdirectory instead of finding all source files directly.

@HanatoK
Copy link
Member Author

HanatoK commented Mar 21, 2025

I did a performance benchmark comparing this PR with the traditional GlobalMaster interface. The test system had 86,550 atoms, with two RMSD CVs (both of them have 1,189 atoms) and a harmonic restraint defined. The integration timestep was 4ns, and the simulations were completed in 50,000 steps. Two CPU threads were used. The GPU is RTX 3060 laptop.

Simulation Speed (ns/day)
No Colvars 62.9669
Colvars with this interface 62.4366
Colvars with GlobalMaster 57.365

@HanatoK
Copy link
Member Author

HanatoK commented Mar 22, 2025

Another issue:

It is a bit strange that this plugin of CudaGlobalMaster loads the colvarproxy related symbols from NAMD, instead of the code Colvars source compiled with it.

HanatoK added 3 commits March 24, 2025 13:16
This commit uses a custom allocator for the containers of positions,
applied forces, total forces, masses and charges. The custom allocator
can ensure that the vectors are allocated on host-pinned memory, so that
the CUDA transpose kernels can directly transpose and copy the data from
GPU to host, which reduces the data moving.
@HanatoK
Copy link
Member Author

HanatoK commented Mar 24, 2025

With the optimization in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/402 and the fix of dlopen in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/400, I have carried out another benchmark using the RMSD test case that @giacomofiorin sent to me. The system has 315,640 atoms, and the RMSD CV involves 41,836 atoms. The simulations were run on AMD Ryzen Threadripper PRO 3975WX + NVIDIA RTX A6000.

Simulation Speed (ns/day)
No Colvars 32.6442
Colvars with this interface 27.1574
Colvars with GlobalMaster 14.2476

@HanatoK
Copy link
Member Author

HanatoK commented Mar 25, 2025

With the optimization in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/402 and the fix of dlopen in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/400, I have carried out another benchmark using the RMSD test case that @giacomofiorin sent to me. The system has 315,640 atoms, and the RMSD CV involves 41,836 atoms. The simulations were run on AMD Ryzen Threadripper PRO 3975WX + NVIDIA RTX A6000.

Simulation Speed (ns/day)
No Colvars 32.6442
Colvars with this interface 27.1574
Colvars with GlobalMaster 14.2476

The benchmark above does not enable PME. I have implemented CUDA allocator for the vectors in Colvars (also with some optimizations on the NAMD side), and conducted the benchmark again with PME and on more platforms:

AMD Ryzen Threadripper PRO 3975WX + NVIDIA RTX A6000

Simulation Speed (ns/day)
No Colvars 27.8499
Colvars with CudaGlobalMaster 26.871
Colvars with GlobalMaster 13.9529

Intel(R) Xeon(R) Platinum 8168 CPU + 2xTesla V100-SXM3-32GB

Simulation Speed (ns/day)
No Colvars 54.1266
Colvars with CudaGlobalMaster 38.1852
Colvars with GlobalMaster 7.22716

ARM Neoverse-V2 + NVIDIA GH200

Simulation Speed (ns/day)
No Colvars 78.6648
Colvars with CudaGlobalMaster 67.9719
Colvars with GlobalMaster 22.5077

@jhenin
Copy link
Member

jhenin commented Mar 26, 2025

Impressive results! Do you see a path to making this work transparently in specialized, static builds of Colvars/NAMD, or are there lasting reasons that this needs to be a dynamic library?

@HanatoK
Copy link
Member Author

HanatoK commented Mar 26, 2025

Impressive results! Do you see a path to making this work transparently in specialized, static builds of Colvars/NAMD, or are there lasting reasons that this needs to be a dynamic library?

No. This implementation uses a new interface (I call it CudaGlobalMaster) which is independent of the old GlobalMaster interface. The GlobalMaster interface was designed more than twenty years ago when there was no GPU computing. To be compatible with GlobalMaster, the GPU-resident code path has to copy all atoms from the GPU memory to SOA buffers of patches, and then the ComputeGlobal objects convert them to AOS buffers, and then the ComputeGlobal objects send the aggregated atoms to GlobalMasterServer via Charm++ message, and finally GlobalMasterServer copies the requested atoms to Colvars (as a client derived from GlobalMaster). As you can see, the data copying is massive and very indirect. I am not sure how to improve it without breaking many things like the CPU MPI build.

CudaGlobalMaster is specialized for the GPU-resident mode, only copies those atoms that are requested by clients and copies them only once (for Colvars it has to be twice because Colvars requires an extra GPU-CPU data copying). Dynamic loading is more flexible because the plugins can link to any other 3rd libraries as they want. If I use dynamic/static linking for the interface, then the clients that use pytorch or tensorflow would force NAMD to link them to, which makes NAMD nearly impossible to distribute.

@giacomofiorin
Copy link
Member

I totally agree with @jhenin!

Many of us thought that much of the slowdown when running NAMD GPU-resident + Colvars came from the way data is copied (which is inherited by the constraints of GlobalMaster, an almost 30-year old piece of code). But I am still amazed that you managed to get such a speedup without even touching the (slow and inefficient) code in the Colvars library. Absolutely impressive indeed!

If you think that dynamic linkage is absolutely required, we could work with it: you just demonstrated very clearly the value of supporting CUDAGlobalMaster.

That said, in making your considerations please also factor in that non-static executables take more work to install and maintain, either by the users or by their support staff. Academic institutions have historically had a difficult time recruiting good sysadmins, and (at least in the US) this is getting even harder lately :-(

@HanatoK
Copy link
Member Author

HanatoK commented Mar 26, 2025

I totally agree with @jhenin!

Many of us thought that much of the slowdown when running NAMD GPU-resident + Colvars came from the way data is copied (which is inherited by the constraints of GlobalMaster, an almost 30-year old piece of code). But I am still amazed that you managed to get such a speedup without even touching the (slow and inefficient) code in the Colvars library. Absolutely impressive indeed!

If you think that dynamic linkage is absolutely required, we could work with it: you just demonstrated very clearly the value of supporting CUDAGlobalMaster.

That said, in making your considerations please also factor in that non-static executables take more work to install and maintain, either by the users or by their support staff. Academic institutions have historically had a difficult time recruiting good sysadmins, and (at least in the US) this is getting even harder lately :-(

Thanks for your comments! This is a plugin-like library. It contains all Colvars code symbols itself, and NAMD loads it dynamically. In other words it is dynamically loaded but not dynamically linked, and it should be targeted to a specific version of NAMD. If NAMD users don't use Colvars, then they don't need to load this library. This is similar to PLUMED where you dynamically load libplumedKernel.so. I think since PLUMED has a much larger user base than Colvars, and many sysadmins already know how to maintain PLUMED, it would be not difficult to maintain a plugin like this.

@jhenin
Copy link
Member

jhenin commented Mar 26, 2025

Thanks for the explanations @HanatoK . I'm sure sysadmins would manage to build this if they had to, but if there is any way at all we can maintain the out-of-the-box, seamless Colvars experience, I think that would be a huge benefit to our users. Right now they have inputs that run seamlessly with GlobalMaster. If they could use the same input and the official binary from Illinois and get the performance that you just unlocked, that would be just awesome. I'm willing to put my own time and effort into this if there is any chance to make it happen.

@HanatoK
Copy link
Member Author

HanatoK commented Mar 26, 2025

Thanks for the explanations @HanatoK . I'm sure sysadmins would manage to build this if they had to, but if there is any way at all we can maintain the out-of-the-box, seamless Colvars experience, I think that would be a huge benefit to our users. Right now they have inputs that run seamlessly with GlobalMaster. If they could use the same input and the official binary from Illinois and get the performance that you just unlocked, that would be just awesome. I'm willing to put my own time and effort into this if there is any chance to make it happen.

Thanks! This interface is still preliminary. I just completed the TCL integration partially today and I will need to test it. I still don't know how colvarscript works, and I don't even know whether colvarscript requires TCL or not. From the CudaGlobalMaster interface side, I have implemented virtual std::string updateFromTCLCommand(const std::vector<std::string>& arguments) that accepts arguments between runs and allows the client to do anything it wants. In other words, I expect that Colvars has a general scripting interface like int cvscript_run(int argc, char* argv[]).

@giacomofiorin
Copy link
Member

I'm willing to put my own time and effort into this if there is any chance to make it happen.

Likewise from me :-)

Having Colvars in all official builds reaches a fairly large user base, which has evolved to rely on it (often in not so visible ways). Besides the several tutorials that exist around, CHARMM-GUI implicitly relies on Colvars when producing NAMD input decks for most membrane systems. There is a high chance that a typical NAMD user will also become a Colvars user at some point.

Thanks! This interface is still preliminary. I just completed the TCL integration partially today and I will need to test it. I still don't know how colvarscript works, and I don't even know whether colvarscript requires TCL or not. From the CudaGlobalMaster interface side, I have implemented virtual std::string updateFromTCLCommand(const std::vector<std::string>& arguments) that accepts arguments between runs and allows the client to do anything it wants. In other words, I expect that Colvars has a general scripting interface like int cvscript_run(int argc, char* argv[]).

Here it looks like something we talked about earlier could help. Contrary to the early days, it makes less and less sense to derive colvarproxy_namd from GlobalMaster. Even for a plugin implementation, it would be better to consolidate what is common as much as possible, have a more abstract interface that supports both GlobalMaster and CUDAGlobalMaster (based on the availability of the code and the user's input).

@HanatoK
Copy link
Member Author

HanatoK commented Mar 27, 2025

Here it looks like something we talked about earlier could help. Contrary to the early days, it makes less and less sense to derive colvarproxy_namd from GlobalMaster. Even for a plugin implementation, it would be better to consolidate what is common as much as possible, have a more abstract interface that supports both GlobalMaster and CUDAGlobalMaster (based on the availability of the code and the user's input).

Again, I am afraid that since CUDAGlobalMaster greatly differs from GlobalMaster, it is not very meaningful to have an abstract interface. CUDAGlobalMaster does not support SMP, and it calls clients on a specific PE (master PE) that controls the GPU device, which may not be PE 0. The scripting interface (ScriptTcl) of NAMD only supports PE 0. To support scripting of any clients, I have added ScriptTcl::Tcl_gpuGlobalUpdateClient that basically broadcasts the TCL arguments to all PEs, and only the master PE will receive the arguments. More specifically, in namd_cudaglobalmaster/example/alad.namd of this PR, the line

gpuGlobalCreateClient ../build/libcudaglobalmastercolvars.so COLVARS opes.colvars

creates a client instance COLVARS, and the NAMD TCL command

set result [gpuGlobalUpdateClient COLVARS xxx yyy zzz]

will pass the xxx yyy zzz as a std::vector<std::string> to the COLVARS instance, and expect an std::string result.

I think when we say Colvars scripting there are essentially two "directions":

  1. Colvars calls some scripts to compute the CVs and biases. I guess this can be solved by simply using set_tcl_interp
  2. A scripting language calls Colvars. In such case it seems calling proxy->script->run is enough but I am not sure, and this is not how the traditional GlobalMaster interface work.

The only shared code for both interfaces seem to be updating of masses and charges, PDB readers, setting of simulation temperature and I/O streams.

@giacomofiorin
Copy link
Member

Again, I am afraid that since CUDAGlobalMaster greatly differs from GlobalMaster, it is not very meaningful to have an abstract interface.

That is absolutely true: I most certainly do not want to suggest that those two wildly different classes should have a shared API 😄

@jhenin
Copy link
Member

jhenin commented Mar 27, 2025

Then, if we agree on removing the inheritance, could colvarproxy_namd have a GlobalMaster and a CUDAGlobalMaster member, and switch between them as appropriate?

@HanatoK
Copy link
Member Author

HanatoK commented Mar 27, 2025

Then, if we agree on removing the inheritance, could colvarproxy_namd have a GlobalMaster and a CUDAGlobalMaster member, and switch between them as appropriate?

I am not sure how removing the inheritance works. There still should be a class derived from GlobalMaster to make Colvars work with the GlobalMaster interface. Could you tell me more details about your plan?

@giacomofiorin
Copy link
Member

I am not sure how removing the inheritance works. There still should be a class derived from GlobalMaster to make Colvars work with the GlobalMaster interface. Could you tell me more details about your plan?

This branch contains a commit that makes GlobalMasterColvars a real class that implements a thin wrapper around colvarproxy_namd (as opposed to the latter inheriting from the former).

It passes all tests, minus the ones related to the volmaps: if I can't fix those with a bit more time, it's not a deal breaker for me because my preference would be discontinuing that code path altogether (in #737 and later work).

commit b589c9720ac8163a025bf05a475035f8bcd72f89
Author: HanatoK <summersnow9403@gmail.com>
Date:   Thu Mar 27 15:59:38 2025 -0500

    feat: forward updateFromTCLCommand to Colvars scripting interface
@HanatoK
Copy link
Member Author

HanatoK commented Mar 27, 2025

@giacomofiorin I have tried to implement the scripting in the interface of CudaGlobalMaster in the 66ed6f6 commit. I mainly followed your LAMMPS interface code, and it seems working for commands like cv getnumatoms. But cv reset seems broken. What do I need to do to support cv reset? Could you help me take a look at the code if you have time?

@giacomofiorin
Copy link
Member

@giacomofiorin I have tried to implement the scripting in the interface of CudaGlobalMaster in the 66ed6f6 commit. I mainly followed your LAMMPS interface code, and it seems working for commands like cv getnumatoms. But cv reset seems broken. What do I need to do to support cv reset? Could you help me take a look at the code if you have time?

Absolutely! Reaching out via chat.

HanatoK added 5 commits March 27, 2025 17:02
The colvarmodule object should be only created once and setup() is
called after reloading with "cv configfile".
This should enable the Lepton support and also find the TCL headers
correctly.
This interface does not run Colvars on PE 0 as the traditional
GlobalMaster interface, so it is dangerous to use the same TCL
interpreter that ScriptTcl (on PE 0) owns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants