Add '--pid' to the singularity command
I noticed that sometimes if the application run is interrupted (e.g. Ctrl-C'd), it leaves behind some of this processes (mpiexec.hydra, or python, etc). I discovered it while playing with a Singularity image of NGC GAMESS container, but it is definitely not limited to it.
Here's a simple reproduction using RAPIDS AI interactive example:
$ module use ngc-container-environment-modules
$ module load rapidsai/0.17
$ jupyter notebook --ip 0.0.0.0 --no-browser --notebook-dir /rapids/notebooks
.... Jupyter starts ....
.... I can open the browser, use the notebook, everything's great ....
Now if I hit Ctrl-C, everything shuts down as expected and I get my prompt back. But there are ghosts left behind:
$ ps uxww | grep '[c]onda'
lev 190040 0.5 0.0 2675596 86400 pts/105 S 20:09 0:02 /opt/conda/envs/rapids/bin/python3.7 /opt/conda/envs/rapids/bin/jupyter-lab --allow-root --ip=0.0.0.0 --no-browser --NotebookApp.token=
Changing container_launch definition to be ..... run --nv --pid ......
fixes the problem and eliminates ghost processes. Our singularity is 3.6.4... not sure how this play on other versions.
Can you please submit an issue for this in the Singularity GitHub? I don't think --pid should be required to cleanup from a SIGINT.
--pid can also have some undesired side effects, for instance it breaks NCCL which is used by the DL containers. So I'd rather see this fixed in Singularity than add this workaround here.
Sure, submitted.
Scott, please see this comment: https://github.com/hpcng/singularity/issues/5884#issuecomment-803176024
Looks like that --pid is indeed needed because of the way the container starts jupyter-lab with nohup ... &.
Still leaves a question of why I saw mpiexec.hydra's... Might be worth adding both TINI_SUBREAPER=1 and TINI_KILL_PROCESS_GROUP=1 to the modules for all containers that use tini.
To my knowledge, only the Rapids container uses tini. It seems like --pid may be appropriate there (although I'm still concerned about NCCL), but I'm not sure if it should be applied globally?
That's fair. I played with couple other containers and they don't seem to be affected. The one exception is the GAMESS-17 container, but a) it's a bit of a problem child, and b) that is why I have kept '--pid' there.