Optimization progress

This is document the progress made while working on the code. For now it shows results for local only operation. All results below are gathered using this command line:

-Disableoutput -Problem=moving_star -Max_level=6 -Stopstep=1 -Xscale=32 \
    -Odt=0.5 -Stoptime=0.1 --hpx:threads=6

The results are gathered on a 2 socket Nehalem system, 6 cores each socket. Note that this means all vector operations are limited to using SSE/SSE2 only.

Here is the base line data, commit 850bf4, (click on the image to see it full sized):

This clearly shows that the overall runtime is determined by 5 functions:

Function	Module	CPU Time
taylor<5,class Vc_1::SimdArray<double,8,class Vc_1::Vector<double,struct Vc_1::VectorAbi::Sse>,2> >::set_basis
grid::compute_boundary_interactions_multipole_multipole
grid::compute_interactions
grid::compute_boundary_interactions_monopole_multipole
grid::compute_boundary_interactions_monopole_monopole

After applying some optimizations to the taylor loops, restructuring code by lifting index operations out of loops we get this (commit 3c24cd):

which is a clear improvement.

The next step focused on reducing the impact of calling pow (which is now under the top 5 functions. This mainly involved converting const variables and functions to constexpr and pre-calculating certain expressions. The result can be seen here (commit 432888):

Now we do the same for std::copysign. The result can be seen here (commit f846e7):

All of the above changes have improved the overall runtime by almost 30%.

The next steps should focus on looking into further optimizing the three hotspot functions:

Top Hotspots
taylor<5,class Vc_1::SimdArray<double,8,class Vc_1::Vector<double,struct Vc_1::VectorAbi::Sse>,2> >::set_basis 174.351s
grid::compute_boundary_interactions_multipole_multipole 138.763s
grid::compute_interactions 105.589s

APEX measurement of Octotiger on KNL node at Oregon

The following figures are a picture of Octotiger running on the KNL system (1 node) at UO. HPX was configured with:

    cmake \
    -DCMAKE_CXX_COMPILER="icpc" \
    -DCMAKE_C_COMPILER="icc" \
    -DCMAKE_Fortran_COMPILER="ifort" \
    -DCMAKE_LINKER="xild" \
    -DCMAKE_CXX_FLAGS="-xMIC-AVX512 -march=native -fast" \
    -DCMAKE_C_FLAGS="-xMIC-AVX512 -march=native -fast" \
    -DCMAKE_Fortran_FLAGS="-xMIC-AVX512 -march=native -fast" \
    -DCMAKE_BUILD_TYPE=Release \
    -DHPX_WITH_MAX_CPU_COUNT=272 \
    -DBOOST_ROOT=/usr/local/packages/boost/1.61.0-knl \
    -DCMAKE_INSTALL_PREFIX=${startdir}/install-knl \
    -DHPX_WITH_APEX=TRUE \
    -DAPEX_WITH_ACTIVEHARMONY=TRUE \
    -DACTIVEHARMONY_ROOT=${HOME}/install/activeharmony/4.6.0-knl \
    -DAPEX_WITH_OTF2=TRUE \
    -DOTF2_ROOT=${HOME}/install/otf2/2.0-knl \
    -DHPX_WITH_MALLOC=jemalloc \
    -DJEMALLOC_ROOT=${HOME}/install/jemalloc-3.5.1 \
    -DHWLOC_ROOT=${HOME}/install/hwloc-1.8 \
    -DHPX_WITH_TOOLS=ON \
    ${HOME}/src/hpx-lsu

Octotiger was configured with:

    cmake -DCMAKE_PREFIX_PATH=$HOME/src/tmp/build-knl \
        -DCMAKE_CXX_COMPILER="icpc" \
        -DCMAKE_C_COMPILER="icc" \
        -DCMAKE_AR="xiar" \
        -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_C_FLAGS="-xMIC-AVX512 -march=native -fast" \
        -DCMAKE_Fortran_FLAGS="-xMIC-AVX512 -march=native -fast" \
        -DCMAKE_CXX_FLAGS="-xMIC-AVX512 -march=native -fast" \
        -DOCTOTIGER_WITH_SILO=OFF \
        $HOME/src/octotiger

Octotiger was executed with:

-Disableoutput -Problem=moving_star -Max_level=4 -Stopstep=1 --hpx:threads=68

Here is a view of an OTF2 trace of Octotiger in Vampir 8.5: Here is a view of an APEX concurrency view of the same execution:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization progress

APEX measurement of Octotiger on KNL node at Oregon

Clone this wiki locally