-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise cmake option default values for AMD GPUs #5339
Conversation
Test this please |
Interesting to see how critical these settings continue to be - a 10-70% speedup obtained over the worst settings, depending on NiO problem size and walker count. Do we have clear guidance on the helper threads situation? These seem less likely to help, but do we have current data? |
Please add a note on which version of ROCm was tested, so we have a record here. |
Also: Thoughts on updating frontier build scripts with LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES=0 + notes about it, or do you want to do more investigation first? The build scripts should print a reminder of important environment settings. |
I played with helper threads and it didn't show performance impact. My recommendation remains unchanged. There is no need to open this can of warms. |
I'd like to write such info in our user manual instead of machine specific scripts. |
Proposed changes
QMC_DISABLE_HIP_HOST_REGISTER default to OFF
QMC_OFFLOAD_MEM_ASSOCIATED default ON.
QMC_OFFLOAD_MEM_ASSOCIATED is clearly required. Otherwise hipMemcpyAsyc 10x slower.
In the following study, I investigated our cmake option QMC_DISABLE_HIP_HOST_REGISTER
and a libomptarget environment variable LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES=0. This option doesn't behave exactly as its name indicates. For transfer smaller than specified size, the runtime uses a staging buffer for transfer which needs an extra copy from the source to the staging buffer. I don't see benefit of this "optimizatoin". setting 0 stops this code path.
Tested with rocm 6.3.1 and amdgpu 6.10.5 on OLCF Frontier MI250X.
What type(s) of changes does this code introduce?
Does this introduce a breaking change?
What systems has this change been tested on?
frontier
Checklist