-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize ompGemm_m multiply kernel #327
Conversation
TODO: Remove before merging
Previous implementations left in comments for testing
Attempt to force vectorization of memory copies in m_kern_min. Needs cleanup.
Performance profiling data was collected using The main improvement to note is the time spent in The time in Baseline: Develop branchThis PR |
On the topic of vectorized memory copies of You may note that Advisor sees three loops on lines 411 and 412. I think one is the vectorized loop, one is the remaineder (the iterations not divisible by 4). It looks like the third is a scalar version, perhaps the compiler generates both vector and scalar instructions for the loop, it's somewhat unclear to me why Advisor would see both of them though. The important bit to see here is the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need data on one thread for comparison, and running between nodes as well on single nodes. Also need to find out what/where kmp_fork_barrier is and whether we can reduce it.
Here is data on one thread. These screenshots are a bit busy but the main points here are:
Develop branchThis PR |
Regarding the meaning of
I ran the OpenMP threading analysis. In the screenshot below I've zoomed the graph at the bottom in to a typical 2s interval in the middle of the run, where primarily My main observation here is that the main source of wasted time is edit: In the graph below, the little blue bridge shapes represent parallel regions. VTune tells me that the OpenMP Region Duration Type is either |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is all fine, though the performance gain I'm seeing over ompGemm_m in v1.3 is often quite small. A few little queries (redundant targets for instance)
Just for my information, have you got numbers comparing ompGemm_m in v1.3 and in this version? It's not clear from above which kernel was used for the baseline. |
Here are total run times with current I compiled each version with both The develop and v1.3 branches are nearly identical in runtime as you would expect. You might notice that the
The tk-optimize-multiply-kernel branch is about 20% faster than develop and v1.3 with the intel compiler, which is in line with my previous profiling.
I'm trying to find out if this has been improved in newer versions of the
edit: on Archer2 with
|
After testing different implementations for the temporary c copy, I've come to the conclusion that the loop-based implementation performs the best across both Below is a comparison on myriad
|
This seems like a very sensible way forward. My only remaining question is whether we really need the explicit loop over |
I did a very quick test and replacing the inner loop with a range was slower with the |
This PR optimizes the memory copies in
m_kern_min
andm_kern_max
ofmultiply_kernel_ompGemm_m
. The intention is that this multiply kernel should be favoured in most (if not all) cases and the worse performing multiply kernels could be dropped.In
m_kern_max
we avoid making temporary copies of matricesA
andB
completely by callingdgemm
on the whole matrices including zero elements. This is the main performance gain. The zero elements are skipped when copying the temporary result back toC
, keeping the end result sparse.dgemm
also accepts 1D arrays as arguments, so the only remaining work before thedgemm
call is computing the start and end indices intoA
andB
. After thedgemm
call the result is copied intoC
from a temporary array.In
m_kern_min
we still make temporary copies ofB
andC
because both of them are stored in a sparse data structure. I've tried vectorizing the copies but the benefits are not great due to the typically small values ofnd1
andnd3
. No temporary copy is needed for the resultA
.The common index calculations of
m_kern_max
andm_kern_min
have been refactored into a separate subroutineprecompute_indices
to reduce code duplication.A number of unused variables have also been removed and comments added.
Also adds
system.myriad.make
since I've found myriad useful for testing and development