Skip to content

Optimization progress

Kevin Huck edited this page Dec 9, 2016 · 19 revisions

This is document the progress made while working on the code. For now it shows results for local only operation. All results below are gathered using this command line:

-Disableoutput -Problem=moving_star -Max_level=6 -Stopstep=1 -Xscale=32 \
    -Odt=0.5 -Stoptime=0.1 --hpx:threads=6

The results are gathered on a 2 socket Nehalem system, 6 cores each socket. Note that this means all vector operations are limited to using SSE/SSE2 only.

Here is the base line data, commit 850bf4, (click on the image to see it full sized):

link

This clearly shows that the overall runtime is determined by 5 functions:

Function	Module	CPU Time
taylor<5,class Vc_1::SimdArray<double,8,class Vc_1::Vector<double,struct Vc_1::VectorAbi::Sse>,2> >::set_basis
grid::compute_boundary_interactions_multipole_multipole
grid::compute_interactions
grid::compute_boundary_interactions_monopole_multipole
grid::compute_boundary_interactions_monopole_monopole

After applying some optimizations to the taylor loops, restructuring code by lifting index operations out of loops we get this (commit 3c24cd):

link

which is a clear improvement.

The next step focused on reducing the impact of calling pow (which is now under the top 5 functions. This mainly involved converting const variables and functions to constexpr and pre-calculating certain expressions. The result can be seen here (commit 432888):

link

Now we do the same for std::copysign. The result can be seen here (commit f846e7):

link

All of the above changes have improved the overall runtime by almost 30%.

The next steps should focus on looking into further optimizing the three hotspot functions:

Top Hotspots
taylor<5,class Vc_1::SimdArray<double,8,class Vc_1::Vector<double,struct Vc_1::VectorAbi::Sse>,2> >::set_basis 174.351s
grid::compute_boundary_interactions_multipole_multipole 138.763s
grid::compute_interactions 105.589s

APEX measurement of Octotiger on KNL node at Oregon

The following figures are a picture of Octotiger running on the KNL system (1 node) at UO. HPX was configured with:

    cmake \
    -DCMAKE_CXX_COMPILER="icpc" \
    -DCMAKE_C_COMPILER="icc" \
    -DCMAKE_Fortran_COMPILER="ifort" \
    -DCMAKE_LINKER="xild" \
    -DCMAKE_CXX_FLAGS="-xMIC-AVX512 -march=native -fast" \
    -DCMAKE_C_FLAGS="-xMIC-AVX512 -march=native -fast" \
    -DCMAKE_Fortran_FLAGS="-xMIC-AVX512 -march=native -fast" \
    -DCMAKE_BUILD_TYPE=Release \
    -DHPX_WITH_MAX_CPU_COUNT=272 \
    -DBOOST_ROOT=/usr/local/packages/boost/1.61.0-knl \
    -DCMAKE_INSTALL_PREFIX=${startdir}/install-knl \
    -DHPX_WITH_APEX=TRUE \
    -DAPEX_WITH_ACTIVEHARMONY=TRUE \
    -DACTIVEHARMONY_ROOT=${HOME}/install/activeharmony/4.6.0-knl \
    -DAPEX_WITH_OTF2=TRUE \
    -DOTF2_ROOT=${HOME}/install/otf2/2.0-knl \
    -DHPX_WITH_MALLOC=jemalloc \
    -DJEMALLOC_ROOT=${HOME}/install/jemalloc-3.5.1 \
    -DHWLOC_ROOT=${HOME}/install/hwloc-1.8 \
    -DHPX_WITH_TOOLS=ON \
    ${HOME}/src/hpx-lsu

Octotiger was configured with:

    cmake -DCMAKE_PREFIX_PATH=$HOME/src/tmp/build-knl \
        -DCMAKE_CXX_COMPILER="icpc" \
        -DCMAKE_C_COMPILER="icc" \
        -DCMAKE_AR="xiar" \
        -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_C_FLAGS="-xMIC-AVX512 -march=native -fast" \
        -DCMAKE_Fortran_FLAGS="-xMIC-AVX512 -march=native -fast" \
        -DCMAKE_CXX_FLAGS="-xMIC-AVX512 -march=native -fast" \
        -DOCTOTIGER_WITH_SILO=OFF \
        $HOME/src/octotiger

Octotiger was executed with:

-Disableoutput -Problem=moving_star -Max_level=4 -Stopstep=1 --hpx:threads=68

Here is a view of an OTF2 trace of Octotiger in Vampir 8.5: link Here is a zoomed view (~10ms) of an OTF2 trace of Octotiger in Vampir 8.5: link Here is a view of an APEX concurrency view of the same execution (sampled every 1 second): link

Several questions/observations about this execution:

  • I requested two iterations. Why does it look like four? There appears to be a period halfway through an iteration where concurrency drops to near zero.
  • One worker (#27) appears to do nothing. That is actually the APEX background task, updating the APEX profile - none of that work is measured by APEX, hence the apparent "idle" thread. That's been fixed. (see below)
  • The overall concurrency is poor - the hardware is less than half utilized on average.
  • If Vampir zooms in on the trace, concurrency is obviously poor.

Update Dec. 8, 2016

The build process was streamlined and improved. I created a KNL toolchain file for HPX that provides the settings for building on "normal" socket-based KNLs.

The new toolchain file:

# Copyright (c) 2016 Kevin Huck
#
# Distributed under the Boost Software License, Version 1.0. (See accompanying
# file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
#
# This is the default toolchain file to be used with Intel Xeon KNLs. It sets
# the appropriate compile flags and compiler such that HPX will compile.
# Note that you still need to provide Boost, hwloc and other utility libraries
# like a custom allocator yourself.
#

# Set the Cray Compiler Wrapper
set(CMAKE_CXX_COMPILER icpc)
set(CMAKE_C_COMPILER icc)
set(CMAKE_Fortran_COMPILER ifort)
set(CMAKE_C_FLAGS_INIT "-xMIC-AVX512" CACHE STRING "")
set(CMAKE_SHARED_LIBRARY_C_FLAGS "-fPIC" CACHE STRING "")
set(CMAKE_SHARED_LIBRARY_CREATE_C_FLAGS "-fPIC" CACHE STRING "")
set(CMAKE_CXX_FLAGS_INIT "-xMIC-AVX512" CACHE STRING "")
set(CMAKE_SHARED_LIBRARY_CXX_FLAGS "-fPIC" CACHE STRING "")
set(CMAKE_SHARED_LIBRARY_CREATE_CXX_FLAGS "-fPIC" CACHE STRING "")
set(CMAKE_Fortran_FLAGS_INIT "-xMIC-AVX512" CACHE STRING "")
set(CMAKE_SHARED_LIBRARY_Fortran_FLAGS "-fPIC" CACHE STRING "")
set(CMAKE_SHARED_LIBRARY_CREATE_Fortran_FLAGS "-shared" CACHE STRING "")
set(HPX_WITH_PARCELPORT_TCP ON CACHE BOOL "")
set(HPX_WITH_PARCELPORT_MPI ON CACHE BOOL "")
set(HPX_WITH_PARCELPORT_MPI_MULTITHREADED OFF CACHE BOOL "")

# We default to system as our allocator on the KNL
if(NOT DEFINED HPX_WITH_MALLOC)
  set(HPX_WITH_MALLOC "system" CACHE STRING "")
endif()

# Set the TBBMALLOC_PLATFORM correctly so that find_package(TBBMalloc) sets the
# right hints
set(TBBMALLOC_PLATFORM "mic-knl" CACHE STRING "")

# We have a bunch of cores on the MIC ... increase the default
set(HPX_WITH_MAX_CPU_COUNT "512" CACHE STRING "")

# RDTSC is available on Xeon/Phis
set(HPX_WITH_RDTSC ON CACHE BOOL "")

The new HPX config:

    cmake \
    -DCMAKE_TOOLCHAIN_FILE=$HOME/src/hpx-lsu/cmake/toolchains/KNL.cmake \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DBOOST_ROOT=/usr/local/packages/boost/1.61.0-knl \
    -DHPX_WITH_DATAPAR_VC=On -DVc_ROOT=$HOME/src/operation_gordon_bell/Vc-icc \
    -DCMAKE_INSTALL_PREFIX=. \
    -DHPX_WITH_MALLOC=tcmalloc \
    -DTCMALLOC_ROOT=${startdir}/gperftools \
    -DHWLOC_ROOT=${startdir}/hwloc \
    -DHPX_WITH_APEX=TRUE \
    -DHPX_WITH_APEX_NO_UPDATE=TRUE \
    -DAPEX_WITH_ACTIVEHARMONY=TRUE \
    -DACTIVEHARMONY_ROOT=${HOME}/install/activeharmony/4.6.0-knl \
    -DAPEX_WITH_OTF2=TRUE \
    -DOTF2_ROOT=${HOME}/install/otf2/2.0-knl \
    ${HOME}/src/hpx-lsu

Octotiger was configured with:

    cmake \
        -DCMAKE_PREFIX_PATH=$HOME/src/operation_gordon_bell/build-knl \
        -DCMAKE_BUILD_TYPE=Release \
        -DOCTOTIGER_WITH_SILO=OFF \
        $HOME/src/octotiger

Octotiger was executed with:

-Disableoutput -Problem=moving_star -Max_level=4 -Stopstep=1 --hpx:threads=68

Here is an updated view of the same problem, with the current code on Dec. 8, 2016: link

Here is a comparison view using different amounts of hyperthreads - performance does not improve significantly in any case: link

Observations:

  • The concurrency is better, but not good. It maxes out at 80%, and it typically 60%.
  • The APEX overheads have been removed.
  • System behavior seems to account for the drops in concurrency. I (Kevin) suspect that there is contention between the threads for a constrained resource.
  • Next step: try running Octotiger in ThreadSpotter to determine the inefficiencies.