Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise cmake option default values for AMD GPUs #5339

Merged
merged 3 commits into from
Feb 24, 2025

Conversation

ye-luo
Copy link
Contributor

@ye-luo ye-luo commented Feb 24, 2025

Proposed changes

QMC_DISABLE_HIP_HOST_REGISTER default to OFF
QMC_OFFLOAD_MEM_ASSOCIATED default ON.

QMC_OFFLOAD_MEM_ASSOCIATED is clearly required. Otherwise hipMemcpyAsyc 10x slower.

In the following study, I investigated our cmake option QMC_DISABLE_HIP_HOST_REGISTER
and a libomptarget environment variable LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES=0. This option doesn't behave exactly as its name indicates. For transfer smaller than specified size, the runtime uses a staging buffer for transfer which needs an extra copy from the source to the staging buffer. I don't see benefit of this "optimizatoin". setting 0 stops this code path.

cat timing.out 
                    w1792    DMC  Particle  Jastrow  Det  SPO  Prod
input                a8      10.1874 1.5654 0.5169 2.8677 1.3319 9.626
input-hostreg        a8      10.0752 1.5893 0.505 2.2743 1.4611 9.4761
input-async0         a8       8.3971 1.3029 0.4825 2.5875 1.1267 7.8679
input-hostreg-async0 a8       8.8200 1.5495 0.5802 1.7928 1.3616 7.7407
                    w1792    DMC  Particle  Jastrow  Det  SPO  Prod
input                a16     30.5002 4.8787 1.4732 9.1898 4.1123 29.224
input-hostreg        a16     27.9757 5.0044 1.3626 3.3442 5.2758 26.6908
input-async0         a16     19.8995 3.1357 1.2214 6.3096 2.7755 18.8107
input-hostreg-async0 a16     18.4683 3.4287 1.2619 3.2591 3.3108 16.8927
                   w1792     DMC  Particle  Jastrow  Det  SPO  Prod
input                a32     67.6461 7.3679 2.649 26.9637 10.9005 63.8135
input-hostreg        a32     56.3777 8.5321 2.8927 8.9338 13.5435 52.3292
input-async0         a32     58.3469 7.043 2.5035 22.8646 8.7776 55.1875
input-hostreg-async0 a32     47.3078 7.5427 2.6123 7.8112 10.5371 43.5157
                   w896      DMC  Particle  Jastrow  Det  SPO  Prod
input                a64    102.0779 11.8075 3.3752 38.0492 20.5291 95.1362
input-hostreg        a64     94.9863 15.3319 3.6487 18.5188 23.5873 87.6202
input-async0         a64     89.8632 11.2825 3.3091 31.2121 15.9242 84.1878
input-hostreg-async0 a64     80.7440 12.5845 3.2412 15.5435 18.9268 74.7883
                     w224    DMC  Particle  Jastrow  Det  SPO  Prod
input                a128    177.2274 24.1321 3.3008 51.4104 29.3939 170.246
input-hostreg        a128    177.4008 30.2973 3.865 24.9557 36.4347 169.769
input-async0         a128    103.7132 16.904 3.1309 35.6832 17.6886 97.9812
input-hostreg-async0 a128    103.0781 19.1998 3.1908 21.1487 22.0529 96.9672
                    w168     DMC  Particle  Jastrow  Det  SPO  Prod
input                a256    320.8476 42.8842 6.3905 136.066 53.3354 300.395
input-hostreg        a256    337.4341 66.1535 8.2798 67.8192 83.7308 313.366
input-async0         a256    282.2445 41.8899 5.4026 116.024 45.3034 265.753
input-hostreg-async0 a256    290.3307 54.6119 7.3857 65.5321 62.5543 271.046

Tested with rocm 6.3.1 and amdgpu 6.10.5 on OLCF Frontier MI250X.

What type(s) of changes does this code introduce?

  • Build related changes

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

frontier

Checklist

  • Yes. This PR is up to date with current the current state of 'develop'

@ye-luo ye-luo changed the title Revise cmake default for AMD GPUs Revise cmake option default values for AMD GPUs Feb 24, 2025
@ye-luo
Copy link
Contributor Author

ye-luo commented Feb 24, 2025

Test this please

@prckent
Copy link
Contributor

prckent commented Feb 24, 2025

Interesting to see how critical these settings continue to be - a 10-70% speedup obtained over the worst settings, depending on NiO problem size and walker count.

Do we have clear guidance on the helper threads situation? These seem less likely to help, but do we have current data?

@prckent prckent self-requested a review February 24, 2025 15:03
@prckent
Copy link
Contributor

prckent commented Feb 24, 2025

Please add a note on which version of ROCm was tested, so we have a record here.

@prckent prckent merged commit a56e5c8 into QMCPACK:develop Feb 24, 2025
43 checks passed
@prckent
Copy link
Contributor

prckent commented Feb 24, 2025

Also: Thoughts on updating frontier build scripts with LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES=0 + notes about it, or do you want to do more investigation first? The build scripts should print a reminder of important environment settings.

@ye-luo
Copy link
Contributor Author

ye-luo commented Feb 24, 2025

Interesting to see how critical these settings continue to be - a 10-70% speedup obtained over the worst settings, depending on NiO problem size and walker count.

Do we have clear guidance on the helper threads situation? These seem less likely to help, but do we have current data?

I played with helper threads and it didn't show performance impact. My recommendation remains unchanged. There is no need to open this can of warms.

@ye-luo ye-luo deleted the revise-cmake-default-for-amd branch February 24, 2025 17:19
@ye-luo
Copy link
Contributor Author

ye-luo commented Feb 24, 2025

Also: Thoughts on updating frontier build scripts with LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES=0 + notes about it, or do you want to do more investigation first? The build scripts should print a reminder of important environment settings.

I'd like to write such info in our user manual instead of machine specific scripts.

@prckent prckent mentioned this pull request Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants