-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LCAO GPU Optimization V1 reduce device pointer look up (OpenMP target map) #5342
base: develop
Are you sure you want to change the base?
Conversation
…nnecessary transfers due to map(to) instead of using is_device_ptr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please undo changes to dev_ptr so this PR can focus on real optimization.
Can you please add some brief specifics on what the CPU/GPU comparison was? e.g. Processor, GPU, which molecule (basis, electron count), batch size? |
I am not understanding this... What do you want me to do? |
See my comment on the source code https://github.com/QMCPACK/qmcpack/pull/5342/files#r1972521897 |
Preparing file with summary. |
…compared to risks of memory corruption
I profiled runs with and without this change. This optimization pattern works cross platform (nvidia/intel checked)
proposed
the difference is that the old code needs a runtime table lookup to find out the dev_ptr and it requires mutex locking the table during lookup and it is a bottleneck. When threads doesn't do the look up frequently, the cost is negligible. The reason of slow LCAO was its small kernels, high call counts nature. The proposed way does have a drawback. Device pointers are exposed in the host code. This may cased segfault if they are not managed correctly, for example de-referenced on the host by accident. For the moment, I think we can take this PR that uses device ptr. Eventually once we change the code computing basis species by species. We can restore the code using the old code pattern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please only keep changes of switching to dev_ptr
and run clang-format.
@@ -213,6 +213,7 @@ class SoaLocalizedBasisSet : public SoaBasisSetBase<ORBT> | |||
struct SoaLocalizedBSetMultiWalkerMem; | |||
/// multi walker resource handle | |||
ResourceHandle<SoaLocalizedBSetMultiWalkerMem> mw_mem_handle_; | |||
NewTimer & NumCenter_timer_ ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel the need of adding this timer. Those inside LOBasisSet[IonID[c]]->mw_evaluateVGL
is probably enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will leave as I need it for the next one and let's not be pedantic on one timer I really need for the next PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is needed for the next one, it is better to be included in the next one. We keep PRs focusing on self contained topics.
Fantastic!!! Thanks Ye. However, I am almost half through isolating per basiset. Will start in a new PR. |
Please make small PRs. large ones are too painful to review. |
auto* dr_ptr = dr.data(); | ||
auto* r_ptr = r.data(); | ||
auto* correctphase_ptr = correctphase.device_data(); | ||
auto* periodic_image_displacements_ptr = periodic_image_displacements_.device_data(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move this line back to its original spot.
|
||
PRAGMA_OFFLOAD("omp target teams distribute parallel for map(to:SuperTwist_ptr[:SuperTwist.size()], \ | ||
Tv_list_ptr[3*nElec*center_idx:3*nElec], correctphase_ptr[:nElec]) ") | ||
PRAGMA_OFFLOAD(" omp target teams distribute parallel for \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove space " omp
auto* dr_ptr = dr.data(); | ||
auto* r_ptr = r.data(); | ||
//Assuming These are correctly computed on Device | ||
auto* periodic_image_displacements_ptr = periodic_image_displacements_.device_data(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move this line back to its original spot?
auto* LM_ptr = LM.data(); | ||
auto* NL_ptr = NL.data(); | ||
auto* psi_ptr = psi.data(); | ||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove?
Tv_list[idim + 3 * (iw + c * Nw)] = (ions_.R[c][idim] - coordR[idim]) - displ[c][idim]; | ||
displ_list_tr[idim + 3 * (iw + c * Nw)] = displ[c][idim]; | ||
} | ||
for (size_t iw = 0; iw < P_list.size(); iw++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange {
before this line.
Proposed changes
First set of optimizations allowing for a 2.2X speed up by reducing device pointer lookup via OpenMP.
The LCAO branch was about 3 to 4X slower than CPU. After investigation, It seemed like it was due to map(to:) not checking if data was already on device. So converted all calls to is_device_ptr which fixed the issue.
Tested on multiple systems, but more specifically "TOU ASN" molecule: O(3) S(1) C(8) N(5) H(17), with 106 electrons and 1185 basis functions.
What type(s) of changes does this code introduce?
Code Optimization
Does this introduce a breaking change?
What systems has this change been tested on?
Checklist
Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is
simply a reminder of what we are going to look for before merging your code.