Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PMIx support for slurm #24

Open
christopheredsall opened this issue Apr 4, 2019 · 2 comments
Open

PMIx support for slurm #24

christopheredsall opened this issue Apr 4, 2019 · 2 comments
Labels
enhancement New feature or request

Comments

@christopheredsall
Copy link
Contributor

The current build of slurm

$ rpm -q slurm
slurm-17.11.13-1.el7.x86_64

Has support for

$ srun --mpi=list
srun: MPI types are...
srun: openmpi
srun: none
srun: pmi2

Running an openmpi3 job on the management node works

$ module load mpi/openmpi3-x86_64 
$ wget https://raw.githubusercontent.com/wesleykendall/mpitutorial/gh-pages/tutorials/mpi-hello-world/code/mpi_hello_world.c
$ CC=mpicc make mpi_hello_world
mpicc     mpi_hello_world.c   -o mpi_hello_world
$ mpirun -np 2 ./mpi_hello_world
Hello world from processor mgmt, rank 0 out of 2 processors
Hello world from processor mgmt, rank 1 out of 2 processors

But running it from inside a slurm job gives

start
0: --------------------------------------------------------------------------
0: The application appears to have been direct launched using "srun",
0: but OMPI was not built with SLURM's PMI support and therefore cannot
0: execute. There are several options for building PMI support under
0: SLURM, depending upon the SLURM version you are using:
0: 
0:   version 16.05 or later: you can use SLURM's PMIx support. This
0:   requires that you configure and build SLURM --with-pmix.
0: 
0:   Versions earlier than 16.05: you must use either SLURM's PMI-1 or
0:   PMI-2 support. SLURM builds PMI-1 by default, or you can manually
0:   install PMI-2. You must then build Open MPI using --with-pmi pointing
0:   to the SLURM PMI library location.
0: 
0: Please configure as appropriate and try again.
0: --------------------------------------------------------------------------
0: *** An error occurred in MPI_Init
0: *** on a NULL communicator
0: *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
0: ***    and potentially your MPI job)
0: [vm-standard2-2-ad1-0001:18312] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
@christopheredsall christopheredsall added the enhancement New feature or request label Apr 4, 2019
@milliams
Copy link
Member

milliams commented Apr 18, 2019

This will be fixed by #27

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: openmpi
srun: pmi2
srun: pmix
srun: pmix_v3

@ghost
Copy link

ghost commented Jan 21, 2021

This appears to have raised its head again. On a fresh install (following the tutorial) on GCP:

[citc@mgmt-social-marlin ~]$ srun -n 2 --pty /bin/bash
[citc@social-marlin-n1-standard-4-0001 ~]$ rpm -q slurm
slurm-20.02.5-1.7.x86_64
[citc@social-marlin-n1-standard-4-0001 ~]$ rpm -q openmpi
openmpi-4.0.3-3.el8.x86_64
[citc@social-marlin-n1-standard-4-0001 ~]$ /usr/lib64/openmpi/bin/mpirun hostname
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    no-plugins
But I couldn't open the help file:
    /usr/share/pmix/help-pmix-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
[citc@social-marlin-n1-standard-4-0001 ~]$ srun --mpi=list
srun: MPI types are...
srun: cray_shasta
srun: none
srun: pmi2
srun: pmix
srun: pmix_v3    
[citc@social-marlin-n1-standard-4-0001 ~]$

Using MPICH's mpirun and srun works.

If an MPI application is compiled with OpenMPI, it will not run via either mpirun or srun:

[citc@social-marlin-n1-standard-4-0001 ~]$ srun -n1  ./a.out
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    no-plugins
But I couldn't open the help file:
    /usr/share/pmix/help-pmix-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
[social-marlin-n1-standard-4-0001:09172] OPAL ERROR: Out of resource in file ext2x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[social-marlin-n1-standard-4-0001:09172] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
srun: error: social-marlin-n1-standard-4-0001: task 0: Exited with exit code 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants