Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize ompGemm_m multiply kernel #327

Merged
merged 45 commits into from
May 22, 2024
Merged

Conversation

tkoskela
Copy link
Contributor

@tkoskela tkoskela commented Feb 13, 2024

This PR optimizes the memory copies in m_kern_min and m_kern_max of multiply_kernel_ompGemm_m. The intention is that this multiply kernel should be favoured in most (if not all) cases and the worse performing multiply kernels could be dropped.

In m_kern_max we avoid making temporary copies of matrices A and B completely by calling dgemm on the whole matrices including zero elements. This is the main performance gain. The zero elements are skipped when copying the temporary result back to C, keeping the end result sparse. dgemm also accepts 1D arrays as arguments, so the only remaining work before the dgemm call is computing the start and end indices into A and B. After the dgemm call the result is copied into C from a temporary array.

In m_kern_min we still make temporary copies of B and C because both of them are stored in a sparse data structure. I've tried vectorizing the copies but the benefits are not great due to the typically small values of nd1 and nd3. No temporary copy is needed for the result A.

The common index calculations of m_kern_max and m_kern_min have been refactored into a separate subroutine precompute_indices to reduce code duplication.

A number of unused variables have also been removed and comments added.

Also adds system.myriad.make since I've found myriad useful for testing and development

Previous implementations left in comments for testing
Attempt to force vectorization of memory copies in m_kern_min. Needs cleanup.
@tkoskela
Copy link
Contributor Author

tkoskela commented Mar 21, 2024

Performance profiling data was collected using benchmarks/matrix_multiply/si_442.xtl. I ran it on myriad with 8 MPI ranks and 4 OpenMP threads using Intel(R) VTune(TM) Profiler 2022.2.0.

The main improvement to note is the time spent in m_kern_min (103.7s -> 76.9s) and m_kern_max (152.1s -> 85.2s), shown in the "Top Hotspots" table. As an interesting byproduct the time in kmp_fork_barrier, ie. the OpenMP load imbalance has significantly decreased as well (560s -> 279.6s). This would make sense if there are not enough m_kern_min and m_kern_max calls to keep all threads busy. The serial calls are now faster and the idle threads wait for less time. I doubt anything has fundamentally improved with the load balance.

The time in dgemm has also decreased (174s -> 95s). This is a bit surprising, since I would have expected us to be multiplying bigger matrices than previously.

Baseline: Develop branch

image

This PR

image

@tkoskela tkoskela marked this pull request as ready for review March 21, 2024 15:56
@tkoskela tkoskela changed the title Optimize multiply kernel Optimize ompGemm_m multiply kernel Mar 21, 2024
@tkoskela tkoskela requested review from ilectra and davidbowler March 21, 2024 15:57
@tkoskela
Copy link
Contributor Author

tkoskela commented Mar 21, 2024

On the topic of vectorized memory copies of B and C in m_kern_min, this is what the data in Advisor looks like:

image

You may note that Advisor sees three loops on lines 411 and 412. I think one is the vectorized loop, one is the remaineder (the iterations not divisible by 4). It looks like the third is a scalar version, perhaps the compiler generates both vector and scalar instructions for the loop, it's somewhat unclear to me why Advisor would see both of them though. The important bit to see here is the Gain column, which shows the gain from vectorization is less than 1. In the Trip Counts column we see the average trip counts in the vectorized loops are 1 or 2, which explains the low gain. This data is collected using the LargerND input data set I got from @davidbowler

Copy link
Contributor

@davidbowler davidbowler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need data on one thread for comparison, and running between nodes as well on single nodes. Also need to find out what/where kmp_fork_barrier is and whether we can reduce it.

@tkoskela
Copy link
Contributor Author

Here is data on one thread. These screenshots are a bit busy but the main points here are:

  • The total elapsed time is reduced from 505s to 401s
  • The time spent in dgemm is unchanged.
  • The time spent in m_kern_max is reduced from 105s to 42s
  • The time spent in m_kern_min is reduced from 75s to 69s

Develop branch

image

This PR

image

@tkoskela
Copy link
Contributor Author

tkoskela commented Apr 15, 2024

Regarding the meaning of kmp_fork_barrier, I followed the advice in this stackoverflow thread

You need to learn to use VTune better. It has specific OpenMP analyses which avoid you having to ask about the internals of the OpenMP runtime.

I ran the OpenMP threading analysis. In the screenshot below I've zoomed the graph at the bottom in to a typical 2s interval in the middle of the run, where primarily multiply_module, calc_matrix_elements_module and pao_grid_transform_module are being called. I probably won't be able to explain things here, but let's discuss when we next meet.

My main observation here is that the main source of wasted time is Serial - outside parallel regions, rather than thread imabalance in the parallel region as I previously thought. It's not clear to me why in the graph at the bottom the worker threads are sometimes red and sometimes green outside of parallel regions. Maybe this is an internal optimization in the runtime and it has some logic whether to keep the threads spinning or make them wait.

edit: In the graph below, the little blue bridge shapes represent parallel regions. VTune tells me that the OpenMP Region Duration Type is either Good or Fast in nearly all of them, which also reassures me the thread imbalance is not our main issue.

image

@tkoskela
Copy link
Contributor Author

tkoskela commented Apr 15, 2024

Expanding the Serial - outside parallel region line, you get a breakdown of where it is coming from. There are some FFTs here, if you look closely 🔎

image

Copy link
Contributor

@davidbowler davidbowler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is all fine, though the performance gain I'm seeing over ompGemm_m in v1.3 is often quite small. A few little queries (redundant targets for instance)

@davidbowler
Copy link
Contributor

Just for my information, have you got numbers comparing ompGemm_m in v1.3 and in this version? It's not clear from above which kernel was used for the baseline.

@tkoskela
Copy link
Contributor Author

tkoskela commented May 15, 2024

Here are total run times with current develop, the v.1.3 release and the tk-optimize-multiply-kernel branch. These were run on myriad using 8 MPI ranks and 1 OpenMP thread. I repeated each run 3 times, but as you can see the variation in runtime is fairly small. All of these runs are using the ompGemm_m kernel

I compiled each version with both intel/2022.2 and gnu/9.2.0 compilers that were available on myriad.

The develop and v1.3 branches are nearly identical in runtime as you would expect. You might notice that the opt branch is very slow with the gnu compiler. I've tracked this down to the pack intrinsic that is called on

c(ncbeg:ncend) = c(ncbeg:ncend) + pack(tempc(1:nd1, tcbeg:tcend), .true.)

The tk-optimize-multiply-kernel branch is about 20% faster than develop and v1.3 with the intel compiler, which is in line with my previous profiling.

I'm trying to find out if this has been improved in newer versions of the gnu compilers.

compiler CONQUEST branch multiply kernel run 1 run 2 run 3
intel/2022.2 v1.3 ompGemm_m 140.310 s 140.143 s 140.561 s
intel/2022.2 develop ompGemm_m 141.606 s 140.742 s 140.672 s
intel/2022.2 tk-optimize-multiply-kernel ompGemm_m 113.364 s 113.466 s 112.553 s
gnu/9.2.0 v1.3 ompGemm_m 136.200 s 135.697 s 136.856 s
gnu/9.2.0 develop ompGemm_m 136.234 s 138.409 s 136.172 s
gnu/9.2.0 tk-optimize-multiply-kernel ompGemm_m 194.480 s 193.469 s 193.969 s

edit: on Archer2 with gfortran 11.2.0 I am still seeing slower performance

compiler CONQUEST branch multiply kernel run 1 run 2 run 3
gnu/11.2.0 develop ompGemm_m 180.277 s
gnu/11.2.0 tk-optimize-multiply-kernel ompGemm_m 231.697 s

@tkoskela
Copy link
Contributor Author

tkoskela commented May 17, 2024

After testing different implementations for the temporary c copy, I've come to the conclusion that the loop-based implementation performs the best across both intel and gnu compilers, so I've reverted back to that.

Below is a comparison on myriad

compiler CONQUEST branch multiply kernel run 1 run 2 run 3
intel/2022.2 tk-optimize-multiply-kernel ompGemm_m 110.059 s 111.754 s 110.925 s
gnu/9.2.0 tk-optimize-multiply-kernel ompGemm_m 122.344 s 122.161 s 122.022 s

@davidbowler
Copy link
Contributor

This seems like a very sensible way forward. My only remaining question is whether we really need the explicit loop over nd1 or whether that should be a range.

@tkoskela
Copy link
Contributor Author

I did a very quick test and replacing the inner loop with a range was slower with the gnu compiler on my laptop.

@tkoskela tkoskela merged commit 35302e8 into develop May 22, 2024
8 checks passed
@tkoskela tkoskela deleted the tk-optimize-multiply-kernel branch May 22, 2024 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants