LCAO GPU Optimization V1 reduce device pointer look up (OpenMP target map) #5342

anbenali · 2025-02-26T22:22:38Z

Proposed changes

First set of optimizations allowing for a 2.2X speed up by reducing device pointer lookup via OpenMP.

The LCAO branch was about 3 to 4X slower than CPU. After investigation, It seemed like it was due to map(to:) not checking if data was already on device. So converted all calls to is_device_ptr which fixed the issue.

Tested on multiple systems, but more specifically "TOU ASN" molecule: O(3) S(1) C(8) N(5) H(17), with 106 electrons and 1185 basis functions.

What type(s) of changes does this code introduce?

Code Optimization

Does this introduce a breaking change?

No

What systems has this change been tested on?

Checklist

Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is
simply a reminder of what we are going to look for before merging your code.

Yes. This PR is up to date with current the current state of 'develop'
Yes. Code added or changed in the PR has been clang-formatted
No. This PR adds tests to cover any new code, or to catch a bug that is being fixed
No. Documentation has been added (if appropriate)

…nnecessary transfers due to map(to) instead of using is_device_ptr

ye-luo

Please undo changes to dev_ptr so this PR can focus on real optimization.

src/Numerics/SoaCartesianTensor.h

src/QMCWaveFunctions/LCAO/MultiQuinticSpline1D.h

prckent · 2025-02-26T23:35:56Z

Can you please add some brief specifics on what the CPU/GPU comparison was? e.g. Processor, GPU, which molecule (basis, electron count), batch size?

anbenali · 2025-02-27T05:01:53Z

dev_ptr

I am not understanding this... What do you want me to do?

ye-luo · 2025-02-27T05:58:12Z

dev_ptr

I am not understanding this... What do you want me to do?

See my comment on the source code https://github.com/QMCPACK/qmcpack/pull/5342/files#r1972521897

src/QMCWaveFunctions/LCAO/MultiQuinticSpline1D.h

anbenali · 2025-02-27T08:18:53Z

Can you please add some brief specifics on what the CPU/GPU comparison was? e.g. Processor, GPU, which molecule (basis, electron count), batch size?

Preparing file with summary.

…compared to risks of memory corruption

ye-luo · 2025-02-28T01:03:58Z

I profiled runs with and without this change. This optimization pattern works cross platform (nvidia/intel checked)
old

auto* ptr = a.data(); // a is a dual space container.
#pragma target map(ptr)
{}

proposed

auto* dev_ptr = a.data(); // a is a dual space container.
#pragma target is_device_ptr(dev_ptr)
{}

the difference is that the old code needs a runtime table lookup to find out the dev_ptr and it requires mutex locking the table during lookup and it is a bottleneck. When threads doesn't do the look up frequently, the cost is negligible.

The reason of slow LCAO was its small kernels, high call counts nature.
https://github.com/QMCPACK/qmcpack/blob/develop/src/QMCWaveFunctions/LCAO/SoaLocalizedBasisSet.cpp#L251
The basis computation goes through atoms one by one. It should be changed eventually to group the computation by atomic species.

The proposed way does have a drawback. Device pointers are exposed in the host code. This may cased segfault if they are not managed correctly, for example de-referenced on the host by accident.

For the moment, I think we can take this PR that uses device ptr. Eventually once we change the code computing basis species by species. We can restore the code using the old code pattern.

ye-luo

Please only keep changes of switching to dev_ptr and run clang-format.

src/QMCWaveFunctions/LCAO/SoaAtomicBasisSet.h

ye-luo · 2025-02-28T01:08:39Z

src/QMCWaveFunctions/LCAO/SoaLocalizedBasisSet.h

@@ -213,6 +213,7 @@ class SoaLocalizedBasisSet : public SoaBasisSetBase<ORBT>
  struct SoaLocalizedBSetMultiWalkerMem;
  /// multi walker resource handle
  ResourceHandle<SoaLocalizedBSetMultiWalkerMem> mw_mem_handle_;
+  NewTimer & NumCenter_timer_ ;


I don't feel the need of adding this timer. Those inside LOBasisSet[IonID[c]]->mw_evaluateVGL is probably enough.

I will leave as I need it for the next one and let's not be pedantic on one timer I really need for the next PR

If it is needed for the next one, it is better to be included in the next one. We keep PRs focusing on self contained topics.

anbenali · 2025-02-28T01:14:16Z

Fantastic!!! Thanks Ye.
Will clean asap.

However, I am almost half through isolating per basiset. Will start in a new PR.

ye-luo · 2025-02-28T01:20:25Z

Fantastic!!! Thanks Ye. Will clean asap.

However, I am almost half through isolating per basiset. Will start in a new PR.

Please make small PRs. large ones are too painful to review.

ye-luo · 2025-02-28T20:26:08Z

src/QMCWaveFunctions/LCAO/SoaAtomicBasisSet.h

-    auto* dr_ptr = dr.data();
-    auto* r_ptr  = r.data();
+    auto* correctphase_ptr                 = correctphase.device_data();
+    auto* periodic_image_displacements_ptr = periodic_image_displacements_.device_data();


Could you move this line back to its original spot.

ye-luo · 2025-02-28T20:26:39Z

src/QMCWaveFunctions/LCAO/SoaAtomicBasisSet.h


-      PRAGMA_OFFLOAD("omp target teams distribute parallel for map(to:SuperTwist_ptr[:SuperTwist.size()], \
-		      Tv_list_ptr[3*nElec*center_idx:3*nElec], correctphase_ptr[:nElec]) ")
+      PRAGMA_OFFLOAD(" omp target teams distribute parallel for \


remove space " omp

ye-luo · 2025-02-28T20:28:11Z

src/QMCWaveFunctions/LCAO/SoaAtomicBasisSet.h

-    auto* dr_ptr = dr.data();
-    auto* r_ptr  = r.data();
+    //Assuming These are correctly computed on Device
+    auto* periodic_image_displacements_ptr = periodic_image_displacements_.device_data();


Could you move this line back to its original spot?

ye-luo · 2025-02-28T20:28:28Z

src/QMCWaveFunctions/LCAO/SoaAtomicBasisSet.h

-      auto* LM_ptr        = LM.data();
-      auto* NL_ptr        = NL.data();
-      auto* psi_ptr       = psi.data();
+      ///


ye-luo · 2025-02-28T20:29:08Z

src/QMCWaveFunctions/LCAO/SoaLocalizedBasisSet.cpp

-        Tv_list[idim + 3 * (iw + c * Nw)]       = (ions_.R[c][idim] - coordR[idim]) - displ[c][idim];
-        displ_list_tr[idim + 3 * (iw + c * Nw)] = displ[c][idim];
-      }
+    for (size_t iw = 0; iw < P_list.size(); iw++)


Strange { before this line.

First set of optimizations allowing for a 2.5X speed up by avoiding u…

50ba440

…nnecessary transfers due to map(to) instead of using is_device_ptr

ye-luo reviewed Feb 26, 2025

View reviewed changes

src/Numerics/SoaCartesianTensor.h Show resolved Hide resolved

src/QMCWaveFunctions/LCAO/MultiQuinticSpline1D.h Outdated Show resolved Hide resolved

anbenali added 2 commits February 26, 2025 23:55

fix type for complex

46d65f3

Merge branch 'develop' into LCAO_Performance

b4411b3

ye-luo reviewed Feb 27, 2025

View reviewed changes

src/QMCWaveFunctions/LCAO/MultiQuinticSpline1D.h Outdated Show resolved Hide resolved

anbenali added 2 commits February 27, 2025 00:26

Fix for mixed precision

cb684c8

comment commentary...

355dba4

anbenali added 2 commits February 27, 2025 18:57

No performance addition from test

59572ae

Revert Optimization attempts as they did not show enough improvement …

c80b1c2

…compared to risks of memory corruption

anbenali changed the title ~~[WIP] LCAO Optimization~~ LCAO GPU Optimization V1 Feb 27, 2025

ye-luo requested changes Feb 28, 2025

View reviewed changes

anbenali added 5 commits February 28, 2025 05:22

revert from dev_ptr

27db2ec

clang format

db1db01

few additional fixes

a2947de

clang format

9e54aa9

more cleaning

862497e

ye-luo reviewed Feb 28, 2025

View reviewed changes

ye-luo changed the title ~~LCAO GPU Optimization V1~~ LCAO GPU Optimization V1 reduce device pointer look up (OpenMP target map) Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LCAO GPU Optimization V1 reduce device pointer look up (OpenMP target map) #5342

LCAO GPU Optimization V1 reduce device pointer look up (OpenMP target map) #5342

anbenali commented Feb 26, 2025 •

edited

Loading

ye-luo left a comment

prckent commented Feb 26, 2025 •

edited

Loading

anbenali commented Feb 27, 2025

ye-luo commented Feb 27, 2025

anbenali commented Feb 27, 2025

ye-luo commented Feb 28, 2025

ye-luo left a comment

ye-luo Feb 28, 2025

anbenali Feb 28, 2025

ye-luo Feb 28, 2025

anbenali commented Feb 28, 2025

ye-luo commented Feb 28, 2025

ye-luo Feb 28, 2025

ye-luo Feb 28, 2025

ye-luo Feb 28, 2025

ye-luo Feb 28, 2025

ye-luo Feb 28, 2025

LCAO GPU Optimization V1 reduce device pointer look up (OpenMP target map) #5342

Are you sure you want to change the base?

LCAO GPU Optimization V1 reduce device pointer look up (OpenMP target map) #5342

Conversation

anbenali commented Feb 26, 2025 • edited Loading

Proposed changes

What type(s) of changes does this code introduce?

Does this introduce a breaking change?

What systems has this change been tested on?

Checklist

ye-luo left a comment

Choose a reason for hiding this comment

prckent commented Feb 26, 2025 • edited Loading

anbenali commented Feb 27, 2025

ye-luo commented Feb 27, 2025

anbenali commented Feb 27, 2025

ye-luo commented Feb 28, 2025

ye-luo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anbenali commented Feb 28, 2025

ye-luo commented Feb 28, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anbenali commented Feb 26, 2025 •

edited

Loading

prckent commented Feb 26, 2025 •

edited

Loading