HELP on KNL

HELP

HELP (Helsinki Lattice Package) is a Quantum Chromodynamics (QCD) code developed at the Department of Physics, University of Helsinki. It has been developed for several years and contains a bunch of different models. It is not licensed under any OS license per se but available upon request for academic use.

Porting

The pure-MPI version of the code was straightforward to compile by simple adapting the makefiles to use Intel compilers. The change was to change in the files wilson/Make_linux_mpi and libraries/Make_vanilla the lines

CC = mpiicc

OPT = -O3 -xMIC-AVX512 -ip -fp-model fast=2

There is an experimental (and apparantly unfinished) OpenMP version of the code, but that does not work (but segfaults) at least with the test cases we had. This is the case also with other compilers (GNU) and on other platforms (Cray XC40 with Haswell CPUs).

Performance

Test case

A small test case was provided by the lead developer that was fast to run with one node. (I have no idea what it is about).

The case is compute-bound and no I/O is performed (just a kilobyte or so of log file).

Run parameters

A brute-force sweep over relevant run-time parameters was carried out. The run script is in this repository.

Results

The best results from the sweep are as follows.

Run	Runtime (s)
1-w HT (64 MPI tasks)	28.6
2-w HT (128 MPI tasks)	33.5
4-w HT (256 MPI tasks)	51.5
HSW Reference (16 MPI tasks)	17.2
HSW Reference (32 MPI tasks)	10.6

The reference result was obtained with one node of a Cray XC40, featuring two 12c 2.6 GHz Haswell processors and run with one-way hyperthreading. The decomposition does not allow for 24 MPI tasks, hence results with 16 (in one node) and 32 tasks (in two nodes) are given. An extrapolation of those for one full 24-core node would be 13.8 s.

One can thus argue that the performance of a single KNL is 50-60% that of a dual-socket Haswell node.

Analysis

The increase in times from oversubscribing the cores seem to arise from MPI overhead (wait times of both Isends and Irecvs), therefore the one-way hyperthreading is the best option. On KNL, the time required for one site update

The performance difference between one Haswell node and one KNL seems to arise both in slower site updates i.e. floating-point performance (10 us vs 6 us), and, to a lesser degree, from larger MPI wait times on KNL. In any case, clearly the pure-MPI version of the code is not optimal for KNL.

The profile is quite flat i.e. the execution time is spread over multiple routines. 4 most intensive routines consuming ~10% of the execution time each are not really employing AVX-512 vectorization, basing on a rudimentary compiler feedback analysis and the abovementioned floating-point performance figures. Forcing the compiler to vectorize parts it considers not beneficial for vectorization (-vec-threshold0) did not improve the performance, thus more intruisive changes are needed to utilize AVX-512 instructions properly.

Conclusions and outlook

The main prospects for Level-1 work lie in completion of the threaded version. The developers do not have very much ambitions in this direction (but rather working intensively on optimization of a multi-GPU version of the code), and the person who made the early attempts has already left the group. In addition, in order to improve the AVX512 utilization, many of the underlying data structures need to be revised.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md
bench.sh		bench.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HELP on KNL

HELP

Porting

Performance

Test case

Run parameters

Results

Analysis

Conclusions and outlook

About

Releases

Packages

Languages

cschpc/HELP-on-KNL

Folders and files

Latest commit

History

Repository files navigation

HELP on KNL

HELP

Porting

Performance

Test case

Run parameters

Results

Analysis

Conclusions and outlook

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages