deepops icon indicating copy to clipboard operation
deepops copied to clipboard

fetching PIDs for timeout jobs for cleanup sometimes fail to kill processes

Open ilya-da opened this issue 1 year ago • 1 comments

Under some circumstances slurm epilog fail to cleanup processes because of parsing of nvidia-smi pmon

From /var/log/slurm/prolog-epilog

  • for i in $(nvidia-smi pmon -c 1 | tail -n+3 | awk '{print $2}' | grep -v -)
  • logger -s -t slurm-epilog 'Killing residual GPU process Idx ...' <13>Sep 10 15:12:33 slurm-epilog: Killing residual GPU process Idx ...
  • kill -9 Idx                    <---- this is not a valid PID. /etc/slurm/epilog.d/50-exclusive-gpu: line 12: kill: Idx: arguments must be process or job IDs

Regular output should work well, but if for some reason output will contain one more comment line before processes list it will catch non PID line

root@hpc-hostname:~# nvidia-smi pmon -c 1 # gpu pid type sm mem enc dec command # Idx # C/G % % % % name 0 - - - - - - - 1 - - - - - - - 2 - - - - - - - 3 - - - - - - - 4 - - - - - - - 5 - - - - - - - 6 - - - - - - - 7 - - - - - - -

ilya-da avatar Sep 14 '24 14:09 ilya-da

#1316 proposed solution

ilya-da avatar Sep 14 '24 16:09 ilya-da

This issue is stale because it has been open for 60 days with no activity. Please update the issue or it will be closed in 7 days.

github-actions[bot] avatar Nov 14 '24 01:11 github-actions[bot]