[RFC] Example use of NAMD CudaGlobalMaster interface #783

HanatoK · 2025-03-20T20:59:10Z

Hi @giacomofiorin and @jhenin ! This is an example use of the NAMD's CudaGlobalMaster interface with Colvars. The data exchanged with CudaGlobalMaster are supposed to be GPU-resident. Due to Colvars' limitations the implementation still have to copy the data from GPU to CPU, but I think it is the first step to make Colvars GPU-resident (or at least a test-bed for the GPU porting).

Compilation

To test the new interface, you need to run the following commands to build the interface with Colvars as a shared library:

cd namd_cudaglobalmaster/
mkdir build
cd build
cmake -DNAMD_DIR=<YOUR_NAMD_SOURCE_CODE_DIRECTORY> ../
make -j2

Currently the dependencies include NAMD, the Colvars itself, and CUDA runtime. Also, due to recent NAMD changes regarding to the unified reductions, the CudaGlobalMaster is broken, you need to switch to the fix_cudagm_reduction branch to build the interface and NAMD itself (or wait for https://gitlab.com/tcbgUIUC/namd/-/merge_requests/398 being merged).

Example usage

The example NAMD input file can be found in namd_cudaglobalmaster/example/alad.namd, which dynamically loads the shared library built above and runs an OPES simulation along the two dihedral angles of the alanine dipeptide.

Limitations

Colvars does not provide any interface to notify the MD engines that it has finished changing the atom selection, so I have to reallocate all buffers in case of init_atom and clear_atom;
Most of the interface code are copied from the existing colvarproxy_namd.*, but I am still not sure why some functions like update_target_temperature(), update_engine_parameters(), setup_input() and setup_output() seem to be called multiple times there;
CudaGlobalMaster copies the atoms to the buffers in xxxyyyzzz format as discussed in GPU preparation work #652. However, Colvars still uses xyzxyzxyz so I have to transform the arrays in the interface code;
SMP is disabled, as it is conflicted with the goal of GPU-resident;
volmap is not available;
More tests are needed;
The build file namd_cudaglobalmaster/CMakeLists.txt should add Colvars by add_subdirectory instead of finding all source files directly.

HanatoK · 2025-03-21T20:25:05Z

I did a performance benchmark comparing this PR with the traditional GlobalMaster interface. The test system had 86,550 atoms, with two RMSD CVs (both of them have 1,189 atoms) and a harmonic restraint defined. The integration timestep was 4ns, and the simulations were completed in 50,000 steps. Two CPU threads were used. The GPU is RTX 3060 laptop.

Simulation	Speed (ns/day)
No Colvars	62.9669
Colvars with this interface	62.4366
Colvars with GlobalMaster	57.365

HanatoK · 2025-03-22T22:37:08Z

Another issue:

It is a bit strange that this plugin of CudaGlobalMaster loads the colvarproxy related symbols from NAMD, instead of the code Colvars source compiled with it.

This commit uses a custom allocator for the containers of positions, applied forces, total forces, masses and charges. The custom allocator can ensure that the vectors are allocated on host-pinned memory, so that the CUDA transpose kernels can directly transpose and copy the data from GPU to host, which reduces the data moving.

HanatoK · 2025-03-24T20:31:11Z

With the optimization in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/402 and the fix of dlopen in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/400, I have carried out another benchmark using the RMSD test case that @giacomofiorin sent to me. The system has 315,640 atoms, and the RMSD CV involves 41,836 atoms. The simulations were run on AMD Ryzen Threadripper PRO 3975WX + NVIDIA RTX A6000.

Simulation	Speed (ns/day)
No Colvars	32.6442
Colvars with this interface	27.1574
Colvars with GlobalMaster	14.2476

memory

CPU calculation

Use Cuda Allocator

HanatoK · 2025-03-25T20:47:14Z

With the optimization in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/402 and the fix of dlopen in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/400, I have carried out another benchmark using the RMSD test case that @giacomofiorin sent to me. The system has 315,640 atoms, and the RMSD CV involves 41,836 atoms. The simulations were run on AMD Ryzen Threadripper PRO 3975WX + NVIDIA RTX A6000.

Simulation Speed (ns/day)
No Colvars 32.6442
Colvars with this interface 27.1574
Colvars with GlobalMaster 14.2476

The benchmark above does not enable PME. I have implemented CUDA allocator for the vectors in Colvars (also with some optimizations on the NAMD side), and conducted the benchmark again with PME and on more platforms:

AMD Ryzen Threadripper PRO 3975WX + NVIDIA RTX A6000

Simulation	Speed (ns/day)
No Colvars	27.8499
Colvars with CudaGlobalMaster	26.871
Colvars with GlobalMaster	13.9529

Intel(R) Xeon(R) Platinum 8168 CPU + 2xTesla V100-SXM3-32GB

Simulation	Speed (ns/day)
No Colvars	54.1266
Colvars with CudaGlobalMaster	38.1852
Colvars with GlobalMaster	7.22716

ARM Neoverse-V2 + NVIDIA GH200

Simulation	Speed (ns/day)
No Colvars	78.6648
Colvars with CudaGlobalMaster	67.9719
Colvars with GlobalMaster	22.5077

jhenin · 2025-03-26T10:11:12Z

Impressive results! Do you see a path to making this work transparently in specialized, static builds of Colvars/NAMD, or are there lasting reasons that this needs to be a dynamic library?

HanatoK · 2025-03-26T14:28:17Z

Impressive results! Do you see a path to making this work transparently in specialized, static builds of Colvars/NAMD, or are there lasting reasons that this needs to be a dynamic library?

No. This implementation uses a new interface (I call it CudaGlobalMaster) which is independent of the old GlobalMaster interface. The GlobalMaster interface was designed more than twenty years ago when there was no GPU computing. To be compatible with GlobalMaster, the GPU-resident code path has to copy all atoms from the GPU memory to SOA buffers of patches, and then the ComputeGlobal objects convert them to AOS buffers, and then the ComputeGlobal objects send the aggregated atoms to GlobalMasterServer via Charm++ message, and finally GlobalMasterServer copies the requested atoms to Colvars (as a client derived from GlobalMaster). As you can see, the data copying is massive and very indirect. I am not sure how to improve it without breaking many things like the CPU MPI build.

CudaGlobalMaster is specialized for the GPU-resident mode, only copies those atoms that are requested by clients and copies them only once (for Colvars it has to be twice because Colvars requires an extra GPU-CPU data copying). Dynamic loading is more flexible because the plugins can link to any other 3rd libraries as they want. If I use dynamic/static linking for the interface, then the clients that use pytorch or tensorflow would force NAMD to link them to, which makes NAMD nearly impossible to distribute.

giacomofiorin · 2025-03-26T16:19:50Z

I totally agree with @jhenin!

Many of us thought that much of the slowdown when running NAMD GPU-resident + Colvars came from the way data is copied (which is inherited by the constraints of GlobalMaster, an almost 30-year old piece of code). But I am still amazed that you managed to get such a speedup without even touching the (slow and inefficient) code in the Colvars library. Absolutely impressive indeed!

If you think that dynamic linkage is absolutely required, we could work with it: you just demonstrated very clearly the value of supporting CUDAGlobalMaster.

That said, in making your considerations please also factor in that non-static executables take more work to install and maintain, either by the users or by their support staff. Academic institutions have historically had a difficult time recruiting good sysadmins, and (at least in the US) this is getting even harder lately :-(

HanatoK · 2025-03-26T16:41:38Z

I totally agree with @jhenin!

Many of us thought that much of the slowdown when running NAMD GPU-resident + Colvars came from the way data is copied (which is inherited by the constraints of GlobalMaster, an almost 30-year old piece of code). But I am still amazed that you managed to get such a speedup without even touching the (slow and inefficient) code in the Colvars library. Absolutely impressive indeed!

If you think that dynamic linkage is absolutely required, we could work with it: you just demonstrated very clearly the value of supporting CUDAGlobalMaster.

That said, in making your considerations please also factor in that non-static executables take more work to install and maintain, either by the users or by their support staff. Academic institutions have historically had a difficult time recruiting good sysadmins, and (at least in the US) this is getting even harder lately :-(

Thanks for your comments! This is a plugin-like library. It contains all Colvars code symbols itself, and NAMD loads it dynamically. In other words it is dynamically loaded but not dynamically linked, and it should be targeted to a specific version of NAMD. If NAMD users don't use Colvars, then they don't need to load this library. This is similar to PLUMED where you dynamically load libplumedKernel.so. I think since PLUMED has a much larger user base than Colvars, and many sysadmins already know how to maintain PLUMED, it would be not difficult to maintain a plugin like this.

feature: support NAMD_TCL

jhenin · 2025-03-26T20:52:43Z

Thanks for the explanations @HanatoK . I'm sure sysadmins would manage to build this if they had to, but if there is any way at all we can maintain the out-of-the-box, seamless Colvars experience, I think that would be a huge benefit to our users. Right now they have inputs that run seamlessly with GlobalMaster. If they could use the same input and the official binary from Illinois and get the performance that you just unlocked, that would be just awesome. I'm willing to put my own time and effort into this if there is any chance to make it happen.

HanatoK · 2025-03-26T21:30:07Z

Thanks for the explanations @HanatoK . I'm sure sysadmins would manage to build this if they had to, but if there is any way at all we can maintain the out-of-the-box, seamless Colvars experience, I think that would be a huge benefit to our users. Right now they have inputs that run seamlessly with GlobalMaster. If they could use the same input and the official binary from Illinois and get the performance that you just unlocked, that would be just awesome. I'm willing to put my own time and effort into this if there is any chance to make it happen.

Thanks! This interface is still preliminary. I just completed the TCL integration partially today and I will need to test it. I still don't know how colvarscript works, and I don't even know whether colvarscript requires TCL or not. From the CudaGlobalMaster interface side, I have implemented virtual std::string updateFromTCLCommand(const std::vector<std::string>& arguments) that accepts arguments between runs and allows the client to do anything it wants. In other words, I expect that Colvars has a general scripting interface like int cvscript_run(int argc, char* argv[]).

giacomofiorin · 2025-03-27T03:14:13Z

I'm willing to put my own time and effort into this if there is any chance to make it happen.

Likewise from me :-)

Having Colvars in all official builds reaches a fairly large user base, which has evolved to rely on it (often in not so visible ways). Besides the several tutorials that exist around, CHARMM-GUI implicitly relies on Colvars when producing NAMD input decks for most membrane systems. There is a high chance that a typical NAMD user will also become a Colvars user at some point.

Thanks! This interface is still preliminary. I just completed the TCL integration partially today and I will need to test it. I still don't know how colvarscript works, and I don't even know whether colvarscript requires TCL or not. From the CudaGlobalMaster interface side, I have implemented virtual std::string updateFromTCLCommand(const std::vector<std::string>& arguments) that accepts arguments between runs and allows the client to do anything it wants. In other words, I expect that Colvars has a general scripting interface like int cvscript_run(int argc, char* argv[]).

Here it looks like something we talked about earlier could help. Contrary to the early days, it makes less and less sense to derive colvarproxy_namd from GlobalMaster. Even for a plugin implementation, it would be better to consolidate what is common as much as possible, have a more abstract interface that supports both GlobalMaster and CUDAGlobalMaster (based on the availability of the code and the user's input).

HanatoK · 2025-03-27T15:47:03Z

Here it looks like something we talked about earlier could help. Contrary to the early days, it makes less and less sense to derive colvarproxy_namd from GlobalMaster. Even for a plugin implementation, it would be better to consolidate what is common as much as possible, have a more abstract interface that supports both GlobalMaster and CUDAGlobalMaster (based on the availability of the code and the user's input).

Again, I am afraid that since CUDAGlobalMaster greatly differs from GlobalMaster, it is not very meaningful to have an abstract interface. CUDAGlobalMaster does not support SMP, and it calls clients on a specific PE (master PE) that controls the GPU device, which may not be PE 0. The scripting interface (ScriptTcl) of NAMD only supports PE 0. To support scripting of any clients, I have added ScriptTcl::Tcl_gpuGlobalUpdateClient that basically broadcasts the TCL arguments to all PEs, and only the master PE will receive the arguments. More specifically, in namd_cudaglobalmaster/example/alad.namd of this PR, the line

gpuGlobalCreateClient ../build/libcudaglobalmastercolvars.so COLVARS opes.colvars

creates a client instance COLVARS, and the NAMD TCL command

set result [gpuGlobalUpdateClient COLVARS xxx yyy zzz]

will pass the xxx yyy zzz as a std::vector<std::string> to the COLVARS instance, and expect an std::string result.

I think when we say Colvars scripting there are essentially two "directions":

Colvars calls some scripts to compute the CVs and biases. I guess this can be solved by simply using set_tcl_interp
A scripting language calls Colvars. In such case it seems calling proxy->script->run is enough but I am not sure, and this is not how the traditional GlobalMaster interface work.

The only shared code for both interfaces seem to be updating of masses and charges, PDB readers, setting of simulation temperature and I/O streams.

giacomofiorin · 2025-03-27T17:03:07Z

Again, I am afraid that since CUDAGlobalMaster greatly differs from GlobalMaster, it is not very meaningful to have an abstract interface.

That is absolutely true: I most certainly do not want to suggest that those two wildly different classes should have a shared API 😄

jhenin · 2025-03-27T18:04:19Z

Then, if we agree on removing the inheritance, could colvarproxy_namd have a GlobalMaster and a CUDAGlobalMaster member, and switch between them as appropriate?

HanatoK · 2025-03-27T18:08:02Z

Then, if we agree on removing the inheritance, could colvarproxy_namd have a GlobalMaster and a CUDAGlobalMaster member, and switch between them as appropriate?

I am not sure how removing the inheritance works. There still should be a class derived from GlobalMaster to make Colvars work with the GlobalMaster interface. Could you tell me more details about your plan?

giacomofiorin · 2025-03-27T20:59:14Z

I am not sure how removing the inheritance works. There still should be a class derived from GlobalMaster to make Colvars work with the GlobalMaster interface. Could you tell me more details about your plan?

This branch contains a commit that makes GlobalMasterColvars a real class that implements a thin wrapper around colvarproxy_namd (as opposed to the latter inheriting from the former).

It passes all tests, minus the ones related to the volmaps: if I can't fix those with a bit more time, it's not a deal breaker for me because my preference would be discontinuing that code path altogether (in #737 and later work).

commit b589c9720ac8163a025bf05a475035f8bcd72f89 Author: HanatoK <summersnow9403@gmail.com> Date: Thu Mar 27 15:59:38 2025 -0500 feat: forward updateFromTCLCommand to Colvars scripting interface

HanatoK · 2025-03-27T21:29:42Z

@giacomofiorin I have tried to implement the scripting in the interface of CudaGlobalMaster in the 66ed6f6 commit. I mainly followed your LAMMPS interface code, and it seems working for commands like cv getnumatoms. But cv reset seems broken. What do I need to do to support cv reset? Could you help me take a look at the code if you have time?

giacomofiorin · 2025-03-27T21:34:41Z

@giacomofiorin I have tried to implement the scripting in the interface of CudaGlobalMaster in the 66ed6f6 commit. I mainly followed your LAMMPS interface code, and it seems working for commands like cv getnumatoms. But cv reset seems broken. What do I need to do to support cv reset? Could you help me take a look at the code if you have time?

Absolutely! Reaching out via chat.

The colvarmodule object should be only created once and setup() is called after reloading with "cv configfile".

This should enable the Lepton support and also find the TCL headers correctly.

This interface does not run Colvars on PE 0 as the traditional GlobalMaster interface, so it is dangerous to use the same TCL interpreter that ScriptTcl (on PE 0) owns.

HanatoK added 5 commits March 20, 2025 15:23

Example use of NAMD CudaGlobalMaster interface

a96deaf

refactor: use CUDA to transpose the positions

ebf5810

fix: correct the atom id setup and update the masses and charges

c31c033

chore: cleanups

e7cdcd8

chore: remove the debugging log

0e4b8a2

fix: use the correct path for Colvars source code

667ec3c

HanatoK mentioned this pull request Mar 23, 2025

Support of custom allocator for atoms_positions, atoms_total_forces and atoms_new_colvar_forces #784

Open

HanatoK added 3 commits March 24, 2025 13:16

prof: use nvtx to profile Colvars

2185652

opt: remove the unnecessary stream synchronization

2f74cf2

HanatoK added 5 commits March 25, 2025 09:31

opt: transpose to the device arrays at first and then copy to pinned

c4178d6

memory

Fix the deallocate of CudaHostAllocator

d8dd54f

Fix the compilation with C++11

85c0deb

opt: use the onBuffersUpdated interface to separate the data moving from

9594d67

CPU calculation

Merge pull request #4 from HanatoK/api_cudagm_cuda_allocator

59b9d26

Use Cuda Allocator

HanatoK added 2 commits March 26, 2025 15:14

feature: support NAMD_TCL

60aad74

Merge pull request #5 from HanatoK/api_cudagm_cuda_allocator

7ccebe4

feature: support NAMD_TCL

Squashed commit of the following:

66ed6f6

commit b589c9720ac8163a025bf05a475035f8bcd72f89 Author: HanatoK <summersnow9403@gmail.com> Date: Thu Mar 27 15:59:38 2025 -0500 feat: forward updateFromTCLCommand to Colvars scripting interface

HanatoK added 5 commits March 27, 2025 17:02

fix: fix the support of cv reset

4af8d26

fix: fix the setup and initialization

7093df9

The colvarmodule object should be only created once and setup() is called after reloading with "cv configfile".

feat: add TCL and LEPTON to the CMakeLists.txt file

f8156b1

This should enable the Lepton support and also find the TCL headers correctly.

fix: implement TCL callback functions

efa77ed

workaround: use an independent TCL interpreter for Colvars

cc55484

This interface does not run Colvars on PE 0 as the traditional GlobalMaster interface, so it is dangerous to use the same TCL interpreter that ScriptTcl (on PE 0) owns.

HanatoK mentioned this pull request Mar 28, 2025

[RFC] Add an option to parallelize the inner loops instead of CVCs #780

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Example use of NAMD CudaGlobalMaster interface #783

[RFC] Example use of NAMD CudaGlobalMaster interface #783

HanatoK commented Mar 20, 2025

HanatoK commented Mar 21, 2025

HanatoK commented Mar 22, 2025

HanatoK commented Mar 24, 2025

HanatoK commented Mar 25, 2025

jhenin commented Mar 26, 2025

HanatoK commented Mar 26, 2025

giacomofiorin commented Mar 26, 2025

HanatoK commented Mar 26, 2025

jhenin commented Mar 26, 2025

HanatoK commented Mar 26, 2025

giacomofiorin commented Mar 27, 2025

HanatoK commented Mar 27, 2025

giacomofiorin commented Mar 27, 2025

jhenin commented Mar 27, 2025

HanatoK commented Mar 27, 2025

giacomofiorin commented Mar 27, 2025

HanatoK commented Mar 27, 2025

giacomofiorin commented Mar 27, 2025

[RFC] Example use of NAMD CudaGlobalMaster interface #783

Are you sure you want to change the base?

[RFC] Example use of NAMD CudaGlobalMaster interface #783

Conversation

HanatoK commented Mar 20, 2025

Compilation

Example usage

Limitations

HanatoK commented Mar 21, 2025

HanatoK commented Mar 22, 2025

HanatoK commented Mar 24, 2025

HanatoK commented Mar 25, 2025

jhenin commented Mar 26, 2025

HanatoK commented Mar 26, 2025

giacomofiorin commented Mar 26, 2025

HanatoK commented Mar 26, 2025

jhenin commented Mar 26, 2025

HanatoK commented Mar 26, 2025

giacomofiorin commented Mar 27, 2025

HanatoK commented Mar 27, 2025

giacomofiorin commented Mar 27, 2025

jhenin commented Mar 27, 2025

HanatoK commented Mar 27, 2025

giacomofiorin commented Mar 27, 2025

HanatoK commented Mar 27, 2025

giacomofiorin commented Mar 27, 2025