Skip to content

Optimization progress

Kevin Huck edited this page Nov 22, 2016 · 19 revisions

This is document the progress made while working on the code. For now it shows results for local only operation. All results below are gathered using this command line:

-Disableoutput -Problem=moving_star -Max_level=6 -Stopstep=1 -Xscale=32 \
    -Odt=0.5 -Stoptime=0.1 --hpx:threads=6

The results are gathered on a 2 socket Nehalem system, 6 cores each socket. Note that this means all vector operations are limited to using SSE/SSE2 only.

Here is the base line data, commit 850bf4, (click on the image to see it full sized):

link

This clearly shows that the overall runtime is determined by 5 functions:

Function	Module	CPU Time
taylor<5,class Vc_1::SimdArray<double,8,class Vc_1::Vector<double,struct Vc_1::VectorAbi::Sse>,2> >::set_basis
grid::compute_boundary_interactions_multipole_multipole
grid::compute_interactions
grid::compute_boundary_interactions_monopole_multipole
grid::compute_boundary_interactions_monopole_monopole

After applying some optimizations to the taylor loops, restructuring code by lifting index operations out of loops we get this (commit 3c24cd):

link

which is a clear improvement.

The next step focused on reducing the impact of calling pow (which is now under the top 5 functions. This mainly involved converting const variables and functions to constexpr and pre-calculating certain expressions. The result can be seen here (commit 432888):

link

Now we do the same for std::copysign. The result can be seen here (commit f846e7):

link

All of the above changes have improved the overall runtime by almost 30%.

The next steps should focus on looking into further optimizing the three hotspot functions:

Top Hotspots
taylor<5,class Vc_1::SimdArray<double,8,class Vc_1::Vector<double,struct Vc_1::VectorAbi::Sse>,2> >::set_basis 174.351s
grid::compute_boundary_interactions_multipole_multipole 138.763s
grid::compute_interactions 105.589s

APEX measurement of Octotiger on KNL node at Oregon

The following figures are a picture of Octotiger running on the KNL system (1 node) at UO. HPX was configured with:

    cmake \
    -DCMAKE_CXX_COMPILER="icpc" \
    -DCMAKE_C_COMPILER="icc" \
    -DCMAKE_Fortran_COMPILER="ifort" \
    -DCMAKE_LINKER="xild" \
    -DCMAKE_CXX_FLAGS="-xMIC-AVX512 -march=native -fast" \
    -DCMAKE_C_FLAGS="-xMIC-AVX512 -march=native -fast" \
    -DCMAKE_Fortran_FLAGS="-xMIC-AVX512 -march=native -fast" \
    -DCMAKE_BUILD_TYPE=Release \
    -DHPX_WITH_MAX_CPU_COUNT=272 \
    -DBOOST_ROOT=/usr/local/packages/boost/1.61.0-knl \
    -DCMAKE_INSTALL_PREFIX=${startdir}/install-knl \
    -DHPX_WITH_APEX=TRUE \
    -DAPEX_WITH_ACTIVEHARMONY=TRUE \
    -DACTIVEHARMONY_ROOT=${HOME}/install/activeharmony/4.6.0-knl \
    -DAPEX_WITH_OTF2=TRUE \
    -DOTF2_ROOT=${HOME}/install/otf2/2.0-knl \
    -DHPX_WITH_MALLOC=jemalloc \
    -DJEMALLOC_ROOT=${HOME}/install/jemalloc-3.5.1 \
    -DHWLOC_ROOT=${HOME}/install/hwloc-1.8 \
    -DHPX_WITH_TOOLS=ON \
    ${HOME}/src/hpx-lsu

Octotiger was configured with:

    cmake -DCMAKE_PREFIX_PATH=$HOME/src/tmp/build-knl \
        -DCMAKE_CXX_COMPILER="icpc" \
        -DCMAKE_C_COMPILER="icc" \
        -DCMAKE_AR="xiar" \
        -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_C_FLAGS="-xMIC-AVX512 -march=native -fast" \
        -DCMAKE_Fortran_FLAGS="-xMIC-AVX512 -march=native -fast" \
        -DCMAKE_CXX_FLAGS="-xMIC-AVX512 -march=native -fast" \
        -DOCTOTIGER_WITH_SILO=OFF \
        $HOME/src/octotiger

Octotiger was executed with:

-Disableoutput -Problem=moving_star -Max_level=4 -Stopstep=1 --hpx:threads=68

Here is a view of an OTF2 trace of Octotiger in Vampir 8.5: link Here is a view of an APEX concurrency view of the same execution: link