Skip to content

Commit

Permalink
mpi: do node-exclusive scheduling for cray-pals PMI
Browse files Browse the repository at this point in the history
Problem: as documented in the "CORAL2: Flux on Cray Shasta" page
in the flux docs, two flux subinstances sharing the same nodes
can fail due to overlapping port numbers. For some reason, ever
since the vcpu test was added, this has been happening more often.

The solution is to do node-exclusive scheduling at the top level
so the jobs run sequentially.
  • Loading branch information
wihobbs committed Mar 6, 2024
1 parent 6fd80ab commit 9eae124
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion mpi/outer_script.sh
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ COMPILERS="${LCSCHEDCLUSTER}_COMPILERS"
for mpi in ${!MPIS}; do
for compiler in ${!COMPILERS}; do
if [[ $mpi == "cray-mpich" ]]; then
EXTRA_FLUX_SUBMIT_OPTIONS="-o pmi=cray-pals" flux batch -N2 -n4 --flags=waitable --output=kvs $MPI_TESTS_DIRECTORY/inner_script.sh $compiler $mpi
EXTRA_FLUX_SUBMIT_OPTIONS="-o pmi=cray-pals" flux batch --exclusive -N2 --flags=waitable --output=kvs $MPI_TESTS_DIRECTORY/inner_script.sh $compiler $mpi
elif [[ $mpi == "openmpi"* ]]; then
EXTRA_FLUX_SUBMIT_OPTIONS="-o pmi=pmix" flux batch -N2 -n4 --flags=waitable --output=kvs $MPI_TESTS_DIRECTORY/inner_script.sh $compiler $mpi
else
Expand Down

0 comments on commit 9eae124

Please sign in to comment.