Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LCAO GPU Optimization V1 reduce device pointer look up (OpenMP target map) #5342

Open
wants to merge 12 commits into
base: develop
Choose a base branch
from

Conversation

anbenali
Copy link
Contributor

@anbenali anbenali commented Feb 26, 2025

Proposed changes

First set of optimizations allowing for a 2.2X speed up by reducing device pointer lookup via OpenMP.

The LCAO branch was about 3 to 4X slower than CPU. After investigation, It seemed like it was due to map(to:) not checking if data was already on device. So converted all calls to is_device_ptr which fixed the issue.

Tested on multiple systems, but more specifically "TOU ASN" molecule: O(3) S(1) C(8) N(5) H(17), with 106 electrons and 1185 basis functions.

What type(s) of changes does this code introduce?

Code Optimization

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

Checklist

Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is
simply a reminder of what we are going to look for before merging your code.

  • Yes. This PR is up to date with current the current state of 'develop'
  • Yes. Code added or changed in the PR has been clang-formatted
  • No. This PR adds tests to cover any new code, or to catch a bug that is being fixed
  • No. Documentation has been added (if appropriate)
    throughput_comparison

…nnecessary transfers due to map(to) instead of using is_device_ptr
Copy link
Contributor

@ye-luo ye-luo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo changes to dev_ptr so this PR can focus on real optimization.

@prckent
Copy link
Contributor

prckent commented Feb 26, 2025

Can you please add some brief specifics on what the CPU/GPU comparison was? e.g. Processor, GPU, which molecule (basis, electron count), batch size?

@anbenali
Copy link
Contributor Author

dev_ptr

I am not understanding this... What do you want me to do?

@ye-luo
Copy link
Contributor

ye-luo commented Feb 27, 2025

dev_ptr

I am not understanding this... What do you want me to do?

See my comment on the source code https://github.com/QMCPACK/qmcpack/pull/5342/files#r1972521897

@anbenali
Copy link
Contributor Author

Can you please add some brief specifics on what the CPU/GPU comparison was? e.g. Processor, GPU, which molecule (basis, electron count), batch size?

Preparing file with summary.

@anbenali anbenali changed the title [WIP] LCAO Optimization LCAO GPU Optimization V1 Feb 27, 2025
@ye-luo
Copy link
Contributor

ye-luo commented Feb 28, 2025

I profiled runs with and without this change. This optimization pattern works cross platform (nvidia/intel checked)
old

auto* ptr = a.data(); // a is a dual space container.
#pragma target map(ptr)
{}

proposed

auto* dev_ptr = a.data(); // a is a dual space container.
#pragma target is_device_ptr(dev_ptr)
{}

the difference is that the old code needs a runtime table lookup to find out the dev_ptr and it requires mutex locking the table during lookup and it is a bottleneck. When threads doesn't do the look up frequently, the cost is negligible.

The reason of slow LCAO was its small kernels, high call counts nature.
https://github.com/QMCPACK/qmcpack/blob/develop/src/QMCWaveFunctions/LCAO/SoaLocalizedBasisSet.cpp#L251
The basis computation goes through atoms one by one. It should be changed eventually to group the computation by atomic species.

The proposed way does have a drawback. Device pointers are exposed in the host code. This may cased segfault if they are not managed correctly, for example de-referenced on the host by accident.

For the moment, I think we can take this PR that uses device ptr. Eventually once we change the code computing basis species by species. We can restore the code using the old code pattern.

Copy link
Contributor

@ye-luo ye-luo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please only keep changes of switching to dev_ptr and run clang-format.

@@ -213,6 +213,7 @@ class SoaLocalizedBasisSet : public SoaBasisSetBase<ORBT>
struct SoaLocalizedBSetMultiWalkerMem;
/// multi walker resource handle
ResourceHandle<SoaLocalizedBSetMultiWalkerMem> mw_mem_handle_;
NewTimer & NumCenter_timer_ ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel the need of adding this timer. Those inside LOBasisSet[IonID[c]]->mw_evaluateVGL is probably enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave as I need it for the next one and let's not be pedantic on one timer I really need for the next PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is needed for the next one, it is better to be included in the next one. We keep PRs focusing on self contained topics.

@anbenali
Copy link
Contributor Author

Fantastic!!! Thanks Ye.
Will clean asap.

However, I am almost half through isolating per basiset. Will start in a new PR.

@ye-luo
Copy link
Contributor

ye-luo commented Feb 28, 2025

Fantastic!!! Thanks Ye. Will clean asap.

However, I am almost half through isolating per basiset. Will start in a new PR.

Please make small PRs. large ones are too painful to review.

auto* dr_ptr = dr.data();
auto* r_ptr = r.data();
auto* correctphase_ptr = correctphase.device_data();
auto* periodic_image_displacements_ptr = periodic_image_displacements_.device_data();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this line back to its original spot.


PRAGMA_OFFLOAD("omp target teams distribute parallel for map(to:SuperTwist_ptr[:SuperTwist.size()], \
Tv_list_ptr[3*nElec*center_idx:3*nElec], correctphase_ptr[:nElec]) ")
PRAGMA_OFFLOAD(" omp target teams distribute parallel for \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove space " omp

auto* dr_ptr = dr.data();
auto* r_ptr = r.data();
//Assuming These are correctly computed on Device
auto* periodic_image_displacements_ptr = periodic_image_displacements_.device_data();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this line back to its original spot?

auto* LM_ptr = LM.data();
auto* NL_ptr = NL.data();
auto* psi_ptr = psi.data();
///
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

Tv_list[idim + 3 * (iw + c * Nw)] = (ions_.R[c][idim] - coordR[idim]) - displ[c][idim];
displ_list_tr[idim + 3 * (iw + c * Nw)] = displ[c][idim];
}
for (size_t iw = 0; iw < P_list.size(); iw++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange { before this line.

@ye-luo ye-luo changed the title LCAO GPU Optimization V1 LCAO GPU Optimization V1 reduce device pointer look up (OpenMP target map) Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants