Optimization progress

This is document the progress made while working on the code. For now it shows results for local only operation. All results below are gathered using this command line:

-Disableoutput -Problem=moving_star -Max_level=6 -Stopstep=1 -Xscale=32 \
    -Odt=0.5 -Stoptime=0.1 --hpx:threads=6

The results are gathered on a 2 socket Nehalem system, 6 cores each socket. Note that this means all vector operations are limited to using SSE/SSE2 only.

Here is the base line data, commit 850bf4, (click on the image to see it full sized):

This clearly shows that the overall runtime is determined by 5 functions:

Function	Module	CPU Time
taylor<5,class Vc_1::SimdArray<double,8,class Vc_1::Vector<double,struct Vc_1::VectorAbi::Sse>,2> >::set_basis
grid::compute_boundary_interactions_multipole_multipole
grid::compute_interactions
grid::compute_boundary_interactions_monopole_multipole
grid::compute_boundary_interactions_monopole_monopole

After applying some optimizations to the taylor loops, restructuring code by lifting index operations out of loops we get this (commit 3c24cd):

which is a clear improvement.

The next step focused on reducing the impact of calling pow (which is now under the top 5 functions. This mainly involved converting const variables and functions to constexpr and pre-calculating certain expressions. The result can be seen here (commit 432888):

Now we do the same for std::copysign. The result can be seen here (commit f846e7):

All of the above changes have improved the overall runtime by almost 30%.

The next steps should focus on looking into further optimizing the three hotspot functions:

Top Hotspots
taylor<5,class Vc_1::SimdArray<double,8,class Vc_1::Vector<double,struct Vc_1::VectorAbi::Sse>,2> >::set_basis 174.351s
grid::compute_boundary_interactions_multipole_multipole 138.763s
grid::compute_interactions 105.589s

APEX measurement of Octotiger on KNL node at Oregon

The following figures are a picture of Octotiger running on the KNL system (1 node) at UO. HPX was configured with:

    cmake \
    -DCMAKE_CXX_COMPILER="icpc" \
    -DCMAKE_C_COMPILER="icc" \
    -DCMAKE_Fortran_COMPILER="ifort" \
    -DCMAKE_LINKER="xild" \
    -DCMAKE_CXX_FLAGS="-xMIC-AVX512 -march=native -fast" \
    -DCMAKE_C_FLAGS="-xMIC-AVX512 -march=native -fast" \
    -DCMAKE_Fortran_FLAGS="-xMIC-AVX512 -march=native -fast" \
    -DCMAKE_BUILD_TYPE=Release \
    -DHPX_WITH_MAX_CPU_COUNT=272 \
    -DBOOST_ROOT=/usr/local/packages/boost/1.61.0-knl \
    -DCMAKE_INSTALL_PREFIX=${startdir}/install-knl \
    -DHPX_WITH_APEX=TRUE \
    -DAPEX_WITH_ACTIVEHARMONY=TRUE \
    -DACTIVEHARMONY_ROOT=${HOME}/install/activeharmony/4.6.0-knl \
    -DAPEX_WITH_OTF2=TRUE \
    -DOTF2_ROOT=${HOME}/install/otf2/2.0-knl \
    -DHPX_WITH_MALLOC=jemalloc \
    -DJEMALLOC_ROOT=${HOME}/install/jemalloc-3.5.1 \
    -DHWLOC_ROOT=${HOME}/install/hwloc-1.8 \
    -DHPX_WITH_TOOLS=ON \
    ${HOME}/src/hpx-lsu

Octotiger was configured with:

    cmake -DCMAKE_PREFIX_PATH=$HOME/src/tmp/build-knl \
        -DCMAKE_CXX_COMPILER="icpc" \
        -DCMAKE_C_COMPILER="icc" \
        -DCMAKE_AR="xiar" \
        -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_C_FLAGS="-xMIC-AVX512 -march=native -fast" \
        -DCMAKE_Fortran_FLAGS="-xMIC-AVX512 -march=native -fast" \
        -DCMAKE_CXX_FLAGS="-xMIC-AVX512 -march=native -fast" \
        -DOCTOTIGER_WITH_SILO=OFF \
        $HOME/src/octotiger

Octotiger was executed with:

-Disableoutput -Problem=moving_star -Max_level=4 -Stopstep=1 --hpx:threads=68

Here is a view of an OTF2 trace of Octotiger in Vampir 8.5: Here is a zoomed view (~10ms) of an OTF2 trace of Octotiger in Vampir 8.5: Here is a view of an APEX concurrency view of the same execution (sampled every 1 second):

Several questions/observations about this execution:

I requested two iterations. Why does it look like four? There appears to be a period halfway through an iteration where concurrency drops to near zero.
One worker (#27) appears to do nothing. That is actually the APEX background task, updating the APEX profile - none of that work is measured by APEX, hence the apparent "idle" thread. That's been fixed. (see below)
The overall concurrency is poor - the hardware is less than half utilized on average.
If Vampir zooms in on the trace, concurrency is obviously poor.

Update Dec. 8, 2016

The build process was streamlined and improved. I created a KNL toolchain file for HPX that provides the settings for building on "normal" socket-based KNLs.

The new toolchain file:

# Copyright (c) 2016 Kevin Huck
#
# Distributed under the Boost Software License, Version 1.0. (See accompanying
# file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
#
# This is the default toolchain file to be used with Intel Xeon KNLs. It sets
# the appropriate compile flags and compiler such that HPX will compile.
# Note that you still need to provide Boost, hwloc and other utility libraries
# like a custom allocator yourself.
#

# Set the Cray Compiler Wrapper
set(CMAKE_CXX_COMPILER icpc)
set(CMAKE_C_COMPILER icc)
set(CMAKE_Fortran_COMPILER ifort)
set(CMAKE_C_FLAGS_INIT "-xMIC-AVX512" CACHE STRING "")
set(CMAKE_SHARED_LIBRARY_C_FLAGS "-fPIC" CACHE STRING "")
set(CMAKE_SHARED_LIBRARY_CREATE_C_FLAGS "-fPIC" CACHE STRING "")
set(CMAKE_CXX_FLAGS_INIT "-xMIC-AVX512" CACHE STRING "")
set(CMAKE_SHARED_LIBRARY_CXX_FLAGS "-fPIC" CACHE STRING "")
set(CMAKE_SHARED_LIBRARY_CREATE_CXX_FLAGS "-fPIC" CACHE STRING "")
set(CMAKE_Fortran_FLAGS_INIT "-xMIC-AVX512" CACHE STRING "")
set(CMAKE_SHARED_LIBRARY_Fortran_FLAGS "-fPIC" CACHE STRING "")
set(CMAKE_SHARED_LIBRARY_CREATE_Fortran_FLAGS "-shared" CACHE STRING "")
set(HPX_WITH_PARCELPORT_TCP ON CACHE BOOL "")
set(HPX_WITH_PARCELPORT_MPI ON CACHE BOOL "")
set(HPX_WITH_PARCELPORT_MPI_MULTITHREADED OFF CACHE BOOL "")

# We default to system as our allocator on the KNL
if(NOT DEFINED HPX_WITH_MALLOC)
  set(HPX_WITH_MALLOC "system" CACHE STRING "")
endif()

# Set the TBBMALLOC_PLATFORM correctly so that find_package(TBBMalloc) sets the
# right hints
set(TBBMALLOC_PLATFORM "mic-knl" CACHE STRING "")

# We have a bunch of cores on the MIC ... increase the default
set(HPX_WITH_MAX_CPU_COUNT "512" CACHE STRING "")

# RDTSC is available on Xeon/Phis
set(HPX_WITH_RDTSC ON CACHE BOOL "")

The new HPX config:

    cmake \
    -DCMAKE_TOOLCHAIN_FILE=$HOME/src/hpx-lsu/cmake/toolchains/KNL.cmake \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DBOOST_ROOT=/usr/local/packages/boost/1.61.0-knl \
    -DHPX_WITH_DATAPAR_VC=On -DVc_ROOT=$HOME/src/operation_gordon_bell/Vc-icc \
    -DCMAKE_INSTALL_PREFIX=. \
    -DHPX_WITH_MALLOC=tcmalloc \
    -DTCMALLOC_ROOT=${startdir}/gperftools \
    -DHWLOC_ROOT=${startdir}/hwloc \
    -DHPX_WITH_APEX=TRUE \
    -DHPX_WITH_APEX_NO_UPDATE=TRUE \
    -DAPEX_WITH_ACTIVEHARMONY=TRUE \
    -DACTIVEHARMONY_ROOT=${HOME}/install/activeharmony/4.6.0-knl \
    -DAPEX_WITH_OTF2=TRUE \
    -DOTF2_ROOT=${HOME}/install/otf2/2.0-knl \
    ${HOME}/src/hpx-lsu

Octotiger was configured with:

    cmake \
        -DCMAKE_PREFIX_PATH=$HOME/src/operation_gordon_bell/build-knl \
        -DCMAKE_BUILD_TYPE=Release \
        -DOCTOTIGER_WITH_SILO=OFF \
        $HOME/src/octotiger

Octotiger was executed with:

-Disableoutput -Problem=moving_star -Max_level=4 -Stopstep=1 --hpx:threads=68

Here is an updated view of the same problem, with the current code on Dec. 8, 2016:

Here is a comparison view using different amounts of hyperthreads - performance does not improve significantly in any case:

Observations:

The concurrency is better, but not good. It maxes out at 80%, and it typically 60%.
The APEX overheads have been removed.
System behavior seems to account for the drops in concurrency. I (Kevin) suspect that there is contention between the threads for a constrained resource.
Next step: try running Octotiger in ThreadSpotter to determine the inefficiencies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization progress

APEX measurement of Octotiger on KNL node at Oregon

Update Dec. 8, 2016

Clone this wiki locally