ngc-container-environment-modules icon indicating copy to clipboard operation
ngc-container-environment-modules copied to clipboard

Add '--pid' to the singularity command

Open lgorenstein opened this issue 4 years ago • 5 comments

I noticed that sometimes if the application run is interrupted (e.g. Ctrl-C'd), it leaves behind some of this processes (mpiexec.hydra, or python, etc). I discovered it while playing with a Singularity image of NGC GAMESS container, but it is definitely not limited to it.

Here's a simple reproduction using RAPIDS AI interactive example:

$ module use ngc-container-environment-modules
$ module load rapidsai/0.17
$ jupyter notebook --ip 0.0.0.0 --no-browser --notebook-dir /rapids/notebooks
   .... Jupyter starts ....
   .... I can open the browser, use the notebook, everything's great ....

Now if I hit Ctrl-C, everything shuts down as expected and I get my prompt back. But there are ghosts left behind:

$ ps uxww | grep '[c]onda'
lev      190040  0.5  0.0 2675596 86400 pts/105 S    20:09   0:02 /opt/conda/envs/rapids/bin/python3.7 /opt/conda/envs/rapids/bin/jupyter-lab --allow-root --ip=0.0.0.0 --no-browser --NotebookApp.token=

Changing container_launch definition to be ..... run --nv --pid ...... fixes the problem and eliminates ghost processes. Our singularity is 3.6.4... not sure how this play on other versions.

lgorenstein avatar Mar 08 '21 01:03 lgorenstein

Can you please submit an issue for this in the Singularity GitHub? I don't think --pid should be required to cleanup from a SIGINT.

--pid can also have some undesired side effects, for instance it breaks NCCL which is used by the DL containers. So I'd rather see this fixed in Singularity than add this workaround here.

samcmill avatar Mar 19 '21 20:03 samcmill

Sure, submitted.

lgorenstein avatar Mar 19 '21 21:03 lgorenstein

Scott, please see this comment: https://github.com/hpcng/singularity/issues/5884#issuecomment-803176024

Looks like that --pid is indeed needed because of the way the container starts jupyter-lab with nohup ... &.

Still leaves a question of why I saw mpiexec.hydra's... Might be worth adding both TINI_SUBREAPER=1 and TINI_KILL_PROCESS_GROUP=1 to the modules for all containers that use tini.

lgorenstein avatar Mar 19 '21 23:03 lgorenstein

To my knowledge, only the Rapids container uses tini. It seems like --pid may be appropriate there (although I'm still concerned about NCCL), but I'm not sure if it should be applied globally?

samcmill avatar Mar 26 '21 21:03 samcmill

That's fair. I played with couple other containers and they don't seem to be affected. The one exception is the GAMESS-17 container, but a) it's a bit of a problem child, and b) that is why I have kept '--pid' there.

lgorenstein avatar Mar 26 '21 22:03 lgorenstein