Optimize ompGemm_m multiply kernel #327

tkoskela · 2024-02-13T10:27:41Z

This PR optimizes the memory copies in m_kern_min and m_kern_max of multiply_kernel_ompGemm_m. The intention is that this multiply kernel should be favoured in most (if not all) cases and the worse performing multiply kernels could be dropped.

In m_kern_max we avoid making temporary copies of matrices A and B completely by calling dgemm on the whole matrices including zero elements. This is the main performance gain. The zero elements are skipped when copying the temporary result back to C, keeping the end result sparse. dgemm also accepts 1D arrays as arguments, so the only remaining work before the dgemm call is computing the start and end indices into A and B. After the dgemm call the result is copied into C from a temporary array.

In m_kern_min we still make temporary copies of B and C because both of them are stored in a sparse data structure. I've tried vectorizing the copies but the benefits are not great due to the typically small values of nd1 and nd3. No temporary copy is needed for the result A.

The common index calculations of m_kern_max and m_kern_min have been refactored into a separate subroutine precompute_indices to reduce code duplication.

A number of unused variables have also been removed and comments added.

Also adds system.myriad.make since I've found myriad useful for testing and development

…Does not work.

…ernels

TODO: Remove before merging

Previous implementations left in comments for testing

…rnel

Attempt to force vectorization of memory copies in m_kern_min. Needs cleanup.

…EST-release into tk-optimize-multiply-kernel

tkoskela · 2024-03-21T15:53:02Z

Performance profiling data was collected using benchmarks/matrix_multiply/si_442.xtl. I ran it on myriad with 8 MPI ranks and 4 OpenMP threads using Intel(R) VTune(TM) Profiler 2022.2.0.

The main improvement to note is the time spent in m_kern_min (103.7s -> 76.9s) and m_kern_max (152.1s -> 85.2s), shown in the "Top Hotspots" table. As an interesting byproduct the time in kmp_fork_barrier, ie. the OpenMP load imbalance has significantly decreased as well (560s -> 279.6s). This would make sense if there are not enough m_kern_min and m_kern_max calls to keep all threads busy. The serial calls are now faster and the idle threads wait for less time. I doubt anything has fundamentally improved with the load balance.

The time in dgemm has also decreased (174s -> 95s). This is a bit surprising, since I would have expected us to be multiplying bigger matrices than previously.

Baseline: Develop branch

This PR

tkoskela · 2024-03-21T16:10:14Z

On the topic of vectorized memory copies of B and C in m_kern_min, this is what the data in Advisor looks like:

You may note that Advisor sees three loops on lines 411 and 412. I think one is the vectorized loop, one is the remaineder (the iterations not divisible by 4). It looks like the third is a scalar version, perhaps the compiler generates both vector and scalar instructions for the loop, it's somewhat unclear to me why Advisor would see both of them though. The important bit to see here is the Gain column, which shows the gain from vectorization is less than 1. In the Trip Counts column we see the average trip counts in the vectorized loops are 1 or 2, which explains the low gain. This data is collected using the LargerND input data set I got from @davidbowler

davidbowler

We need data on one thread for comparison, and running between nodes as well on single nodes. Also need to find out what/where kmp_fork_barrier is and whether we can reduce it.

tkoskela · 2024-04-15T10:59:35Z

Here is data on one thread. These screenshots are a bit busy but the main points here are:

The total elapsed time is reduced from 505s to 401s
The time spent in dgemm is unchanged.
The time spent in m_kern_max is reduced from 105s to 42s
The time spent in m_kern_min is reduced from 75s to 69s

Develop branch

This PR

tkoskela · 2024-04-15T13:09:39Z

Regarding the meaning of kmp_fork_barrier, I followed the advice in this stackoverflow thread

You need to learn to use VTune better. It has specific OpenMP analyses which avoid you having to ask about the internals of the OpenMP runtime.

I ran the OpenMP threading analysis. In the screenshot below I've zoomed the graph at the bottom in to a typical 2s interval in the middle of the run, where primarily multiply_module, calc_matrix_elements_module and pao_grid_transform_module are being called. I probably won't be able to explain things here, but let's discuss when we next meet.

My main observation here is that the main source of wasted time is Serial - outside parallel regions, rather than thread imabalance in the parallel region as I previously thought. It's not clear to me why in the graph at the bottom the worker threads are sometimes red and sometimes green outside of parallel regions. Maybe this is an internal optimization in the runtime and it has some logic whether to keep the threads spinning or make them wait.

edit: In the graph below, the little blue bridge shapes represent parallel regions. VTune tells me that the OpenMP Region Duration Type is either Good or Fast in nearly all of them, which also reassures me the thread imbalance is not our main issue.

tkoskela · 2024-04-15T13:14:55Z

Expanding the Serial - outside parallel region line, you get a breakdown of where it is coming from. There are some FFTs here, if you look closely 🔎

davidbowler

I think that this is all fine, though the performance gain I'm seeing over ompGemm_m in v1.3 is often quite small. A few little queries (redundant targets for instance)

src/multiply_kernel_ompGemm_m.f90

davidbowler · 2024-05-01T14:54:29Z

Just for my information, have you got numbers comparing ompGemm_m in v1.3 and in this version? It's not clear from above which kernel was used for the baseline.

tkoskela · 2024-05-15T13:17:12Z

Here are total run times with current develop, the v.1.3 release and the tk-optimize-multiply-kernel branch. These were run on myriad using 8 MPI ranks and 1 OpenMP thread. I repeated each run 3 times, but as you can see the variation in runtime is fairly small. All of these runs are using the ompGemm_m kernel

I compiled each version with both intel/2022.2 and gnu/9.2.0 compilers that were available on myriad.

The develop and v1.3 branches are nearly identical in runtime as you would expect. You might notice that the opt branch is very slow with the gnu compiler. I've tracked this down to the pack intrinsic that is called on

CONQUEST-release/src/multiply_kernel_ompGemm_m.f90

Line 219 in 4347802

c(ncbeg:ncend) = c(ncbeg:ncend) + pack(tempc(1:nd1, tcbeg:tcend), .true.)

The tk-optimize-multiply-kernel branch is about 20% faster than develop and v1.3 with the intel compiler, which is in line with my previous profiling.

I'm trying to find out if this has been improved in newer versions of the gnu compilers.

compiler	CONQUEST branch	multiply kernel	run 1	run 2	run 3
intel/2022.2	v1.3	ompGemm_m	140.310 s	140.143 s	140.561 s
intel/2022.2	develop	ompGemm_m	141.606 s	140.742 s	140.672 s
intel/2022.2	tk-optimize-multiply-kernel	ompGemm_m	113.364 s	113.466 s	112.553 s
gnu/9.2.0	v1.3	ompGemm_m	136.200 s	135.697 s	136.856 s
gnu/9.2.0	develop	ompGemm_m	136.234 s	138.409 s	136.172 s
gnu/9.2.0	tk-optimize-multiply-kernel	ompGemm_m	194.480 s	193.469 s	193.969 s

edit: on Archer2 with gfortran 11.2.0 I am still seeing slower performance

compiler	CONQUEST branch	multiply kernel	run 1	run 2	run 3
gnu/11.2.0	develop	ompGemm_m	180.277 s
gnu/11.2.0	tk-optimize-multiply-kernel	ompGemm_m	231.697 s

tkoskela · 2024-05-17T12:40:03Z

After testing different implementations for the temporary c copy, I've come to the conclusion that the loop-based implementation performs the best across both intel and gnu compilers, so I've reverted back to that.

Below is a comparison on myriad

compiler	CONQUEST branch	multiply kernel	run 1	run 2	run 3
intel/2022.2	tk-optimize-multiply-kernel	ompGemm_m	110.059 s	111.754 s	110.925 s
gnu/9.2.0	tk-optimize-multiply-kernel	ompGemm_m	122.344 s	122.161 s	122.022 s

davidbowler · 2024-05-21T05:14:48Z

This seems like a very sensible way forward. My only remaining question is whether we really need the explicit loop over nd1 or whether that should be a range.

tkoskela · 2024-05-22T12:32:59Z

I did a very quick test and replacing the inner loop with a range was slower with the gnu compiler on my laptop.

tkoskela added 30 commits December 19, 2023 11:42

Add ompGemm_m kernel parallelised over j in both min and max kernel. …

19d26e8

…Does not work.

Merge branch 'tk-ci-test-multiply-kernels' into tk-omp-experiments

04b83f1

Add multiply kernel with parallelisation over j in both min and max k…

6c4fd5d

…ernels

Add kernel where we thread over both i and j with collapse(2)

0242279

Change schedule to runtime for testing. Fix omp end bug

914166a

Add OpenAcc multiply kernel for testing

b7822b8

Merge branch 'develop' into tk-omp-experiments

4d19c11

Copy A matrix in array syntax

7cf301d

Do b copy using array syntax outside parallel loop

c94cf89

Remove commented out code

2e65616

Version of acc kernel that compiles but doesn't work.

8680442

TODO: Remove before merging

Use allocatables instead of pointers

9930b94

Shorten loop to copy C

06ed04c

Remove unnecessary initializations

41d45a6

Avoid loop carrier dependencies by precomputing indices

dfb30e6

Copy back to C with a logical mask

90e9121

Previous implementations left in comments for testing

Use pointer to A instead of data copy

1f29e90

Use a pointer to B instead of data copy

d50b006

Clean up and reorder things

704d3c5

Less indices. Still too many indices

1f5a3a7

Ignore more files in tests and benchmarks

b07653d

Move index precomputation out of parallel loop

453ae68

Add syste.make file for myriad

d0f0db7

Merge branch 'tk-add-system.make.myriad' into tk-optimize-multiply-ke…

ef65a38

…rnel

Add XC_COMPFLAGS to compiler with libxc v4

411bdbf

Fix copy c loop. Add comments

1a94b96

Use right multiply kernel on myriad

80c6b53

WIP: refactor m_kern_min

2da61e5

Test ompGemm_m in main workflow

a076735

Refactor index computations out of m_kern_min and m_kern_max

0018828

Attempt to force vectorization of memory copies in m_kern_min. Needs cleanup.

tkoskela added 4 commits March 21, 2024 15:28

Merge branch 'develop' into tk-optimize-multiply-kernel

d4e6225

Merge branch 'tk-optimize-multiply-kernel' of github.com:OrderN/CONQU…

0acdb7f

…EST-release into tk-optimize-multiply-kernel

Revert inefficient vectorized copies, add explaining comment

1bea12a

Add missing variable declaration

6cecd58

tkoskela marked this pull request as ready for review March 21, 2024 15:56

tkoskela changed the title ~~Optimize multiply kernel~~ Optimize ompGemm_m multiply kernel Mar 21, 2024

tkoskela requested review from ilectra and davidbowler March 21, 2024 15:57

Rename myriad make file to new convention

4fddbce

tkoskela mentioned this pull request Mar 22, 2024

Add system.make file for myriad #317

Closed

davidbowler requested changes Mar 26, 2024

View reviewed changes

tkoskela added 2 commits April 19, 2024 15:50

Remove barriers in multiply_module

843db36

Update to myriad makefile

950f2f2

davidbowler reviewed May 1, 2024

View reviewed changes

src/multiply_kernel_ompGemm_m.f90 Outdated Show resolved Hide resolved

src/multiply_kernel_ompGemm_m.f90 Outdated Show resolved Hide resolved

src/multiply_kernel_ompGemm_m.f90 Outdated Show resolved Hide resolved

tkoskela added 2 commits May 3, 2024 15:53

Remove unused targets and commented out pointers

744d635

Merge branch 'develop' into tk-optimize-multiply-kernel

4347802

Revert back to loop-based implementation of c copy

cdd509e

davidbowler approved these changes May 21, 2024

View reviewed changes

tkoskela merged commit 35302e8 into develop May 22, 2024
8 checks passed

tkoskela deleted the tk-optimize-multiply-kernel branch May 22, 2024 12:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize ompGemm_m multiply kernel #327

Optimize ompGemm_m multiply kernel #327

tkoskela commented Feb 13, 2024 •

edited

Loading

tkoskela commented Mar 21, 2024 •

edited

Loading

tkoskela commented Mar 21, 2024 •

edited

Loading

davidbowler left a comment

tkoskela commented Apr 15, 2024

tkoskela commented Apr 15, 2024 •

edited

Loading

tkoskela commented Apr 15, 2024 •

edited

Loading

davidbowler left a comment

davidbowler commented May 1, 2024

tkoskela commented May 15, 2024 •

edited

Loading

tkoskela commented May 17, 2024 •

edited

Loading

davidbowler commented May 21, 2024

tkoskela commented May 22, 2024

Optimize ompGemm_m multiply kernel #327

Optimize ompGemm_m multiply kernel #327

Conversation

tkoskela commented Feb 13, 2024 • edited Loading

tkoskela commented Mar 21, 2024 • edited Loading

Baseline: Develop branch

This PR

tkoskela commented Mar 21, 2024 • edited Loading

davidbowler left a comment

Choose a reason for hiding this comment

tkoskela commented Apr 15, 2024

Develop branch

This PR

tkoskela commented Apr 15, 2024 • edited Loading

tkoskela commented Apr 15, 2024 • edited Loading

davidbowler left a comment

Choose a reason for hiding this comment

davidbowler commented May 1, 2024

tkoskela commented May 15, 2024 • edited Loading

tkoskela commented May 17, 2024 • edited Loading

davidbowler commented May 21, 2024

tkoskela commented May 22, 2024

tkoskela commented Feb 13, 2024 •

edited

Loading

tkoskela commented Mar 21, 2024 •

edited

Loading

tkoskela commented Mar 21, 2024 •

edited

Loading

tkoskela commented Apr 15, 2024 •

edited

Loading

tkoskela commented Apr 15, 2024 •

edited

Loading

tkoskela commented May 15, 2024 •

edited

Loading

tkoskela commented May 17, 2024 •

edited

Loading