Problem calling Underworld container in a Slurm Environment
I've encountered a problem calling Underworld inside a Singularity container in a Slurm Environment. This applies only to recent versions of Underworld, probably due to a newer version of MPI being nstalled. Previously people invoked Underworld on our systen using shell scripts that expand to something like:
singularity exec underworld.simg mpirun -np 2 python script.py
This worked on previous version of UW but on the latest (underworld/2.12.2b) we get an error
[mpiexec@mk05] HYDU_create_process (utils/launch/launch.c:73): execvp error on file srun (No such file or directory)
To cut a long story short, the mpi version inside the container recognizes the presence of SLURM environment variables, and tries to call srun, which is not in the container. We can not insert the path of srun (from outside the container) into the system, as our srun is installed in /opt, and this file system is overlaid by the Singularity container.
We can get around this problem by unsetting all SLURM environment variables, i.e.
#remove SLURM environment variables
for i in printenv | grep SLURM; do unset echo $i | cut -d '=' -f1; done
and then mpirun in the container works normally as expected.
This is not an issue with UW, but an artifact on how Slurm and Singularity and MPI interact. I am not sure if there is a more elegant fix, but would like to raise the issue for comment. regards Simon
Hi Simon, Much appreciated for the detailed feedback on this. Yeah it's suboptimal! I'm happy to look at the docker MPI environment from the Underworld end and investigate possible improvements. Do you have any guidelines on the system you're running? cheers, Julian
Julian we took the Docker container and converted it to Singularity as Docker is not supported on our HPC systems. Monarch/M3 currently use Centos7 but aim to move to Ubuntu in the future.
There are some approaches IMHO.
- Insert Slurm into the container. This has a problem as the version of slurm in the container may not be compatible with what is in the parent OS. Slurm will only be backwardly compatible for the 2 previous versions, so this will eventually be out of date
- Can we invoke mpirun with a flag telling it NOT to use slurm? I am not sure what that might be
- By trial and error work out which Slurm env that is causing mpirun to use slurm. So only unset one var (still not the best option)
- perhaps not having an /opt in the container? This may break something else. Whatever you do you may find people install slurm into different locations, including system paths that the container would overlay Simon