-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Example use of NAMD CudaGlobalMaster interface #783
base: master
Are you sure you want to change the base?
Conversation
I did a performance benchmark comparing this PR with the traditional GlobalMaster interface. The test system had 86,550 atoms, with two RMSD CVs (both of them have 1,189 atoms) and a harmonic restraint defined. The integration timestep was 4ns, and the simulations were completed in 50,000 steps. Two CPU threads were used. The GPU is RTX 3060 laptop.
|
Another issue: It is a bit strange that this plugin of CudaGlobalMaster loads the |
This commit uses a custom allocator for the containers of positions, applied forces, total forces, masses and charges. The custom allocator can ensure that the vectors are allocated on host-pinned memory, so that the CUDA transpose kernels can directly transpose and copy the data from GPU to host, which reduces the data moving.
With the optimization in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/402 and the fix of dlopen in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/400, I have carried out another benchmark using the RMSD test case that @giacomofiorin sent to me. The system has 315,640 atoms, and the RMSD CV involves 41,836 atoms. The simulations were run on AMD Ryzen Threadripper PRO 3975WX + NVIDIA RTX A6000.
|
Use Cuda Allocator
The benchmark above does not enable PME. I have implemented CUDA allocator for the vectors in Colvars (also with some optimizations on the NAMD side), and conducted the benchmark again with PME and on more platforms: AMD Ryzen Threadripper PRO 3975WX + NVIDIA RTX A6000
Intel(R) Xeon(R) Platinum 8168 CPU + 2xTesla V100-SXM3-32GB
ARM Neoverse-V2 + NVIDIA GH200
|
Impressive results! Do you see a path to making this work transparently in specialized, static builds of Colvars/NAMD, or are there lasting reasons that this needs to be a dynamic library? |
No. This implementation uses a new interface (I call it CudaGlobalMaster) which is independent of the old GlobalMaster interface. The GlobalMaster interface was designed more than twenty years ago when there was no GPU computing. To be compatible with GlobalMaster, the GPU-resident code path has to copy all atoms from the GPU memory to SOA buffers of patches, and then the
|
I totally agree with @jhenin! Many of us thought that much of the slowdown when running NAMD GPU-resident + Colvars came from the way data is copied (which is inherited by the constraints of GlobalMaster, an almost 30-year old piece of code). But I am still amazed that you managed to get such a speedup without even touching the (slow and inefficient) code in the Colvars library. Absolutely impressive indeed! If you think that dynamic linkage is absolutely required, we could work with it: you just demonstrated very clearly the value of supporting CUDAGlobalMaster. That said, in making your considerations please also factor in that non-static executables take more work to install and maintain, either by the users or by their support staff. Academic institutions have historically had a difficult time recruiting good sysadmins, and (at least in the US) this is getting even harder lately :-( |
Thanks for your comments! This is a plugin-like library. It contains all Colvars code symbols itself, and NAMD loads it dynamically. In other words it is dynamically loaded but not dynamically linked, and it should be targeted to a specific version of NAMD. If NAMD users don't use Colvars, then they don't need to load this library. This is similar to PLUMED where you dynamically load |
feature: support NAMD_TCL
Thanks for the explanations @HanatoK . I'm sure sysadmins would manage to build this if they had to, but if there is any way at all we can maintain the out-of-the-box, seamless Colvars experience, I think that would be a huge benefit to our users. Right now they have inputs that run seamlessly with GlobalMaster. If they could use the same input and the official binary from Illinois and get the performance that you just unlocked, that would be just awesome. I'm willing to put my own time and effort into this if there is any chance to make it happen. |
Thanks! This interface is still preliminary. I just completed the TCL integration partially today and I will need to test it. I still don't know how |
Likewise from me :-) Having Colvars in all official builds reaches a fairly large user base, which has evolved to rely on it (often in not so visible ways). Besides the several tutorials that exist around, CHARMM-GUI implicitly relies on Colvars when producing NAMD input decks for most membrane systems. There is a high chance that a typical NAMD user will also become a Colvars user at some point.
Here it looks like something we talked about earlier could help. Contrary to the early days, it makes less and less sense to derive |
Again, I am afraid that since gpuGlobalCreateClient ../build/libcudaglobalmastercolvars.so COLVARS opes.colvars creates a client instance set result [gpuGlobalUpdateClient COLVARS xxx yyy zzz] will pass the I think when we say Colvars scripting there are essentially two "directions":
The only shared code for both interfaces seem to be updating of masses and charges, PDB readers, setting of simulation temperature and I/O streams. |
That is absolutely true: I most certainly do not want to suggest that those two wildly different classes should have a shared API 😄 |
Then, if we agree on removing the inheritance, could |
I am not sure how removing the inheritance works. There still should be a class derived from |
This branch contains a commit that makes It passes all tests, minus the ones related to the volmaps: if I can't fix those with a bit more time, it's not a deal breaker for me because my preference would be discontinuing that code path altogether (in #737 and later work). |
commit b589c9720ac8163a025bf05a475035f8bcd72f89 Author: HanatoK <summersnow9403@gmail.com> Date: Thu Mar 27 15:59:38 2025 -0500 feat: forward updateFromTCLCommand to Colvars scripting interface
@giacomofiorin I have tried to implement the scripting in the interface of CudaGlobalMaster in the 66ed6f6 commit. I mainly followed your LAMMPS interface code, and it seems working for commands like |
Absolutely! Reaching out via chat. |
The colvarmodule object should be only created once and setup() is called after reloading with "cv configfile".
This should enable the Lepton support and also find the TCL headers correctly.
This interface does not run Colvars on PE 0 as the traditional GlobalMaster interface, so it is dangerous to use the same TCL interpreter that ScriptTcl (on PE 0) owns.
Hi @giacomofiorin and @jhenin ! This is an example use of the NAMD's CudaGlobalMaster interface with Colvars. The data exchanged with CudaGlobalMaster are supposed to be GPU-resident. Due to Colvars' limitations the implementation still have to copy the data from GPU to CPU, but I think it is the first step to make Colvars GPU-resident (or at least a test-bed for the GPU porting).
Compilation
To test the new interface, you need to run the following commands to build the interface with Colvars as a shared library:
Currently the dependencies include NAMD, the Colvars itself, and CUDA runtime. Also, due to recent NAMD changes regarding to the unified reductions, the CudaGlobalMaster is broken, you need to switch to the fix_cudagm_reduction branch to build the interface and NAMD itself (or wait for https://gitlab.com/tcbgUIUC/namd/-/merge_requests/398 being merged).
Example usage
The example NAMD input file can be found in
namd_cudaglobalmaster/example/alad.namd
, which dynamically loads the shared library built above and runs an OPES simulation along the two dihedral angles of the alanine dipeptide.Limitations
init_atom
andclear_atom
;colvarproxy_namd.*
, but I am still not sure why some functions likeupdate_target_temperature()
,update_engine_parameters()
,setup_input()
andsetup_output()
seem to be called multiple times there;xxxyyyzzz
format as discussed in GPU preparation work #652. However, Colvars still usesxyzxyzxyz
so I have to transform the arrays in the interface code;namd_cudaglobalmaster/CMakeLists.txt
should add Colvars byadd_subdirectory
instead of finding all source files directly.