ReFrame crashed when it queried a `REQUEUED` job with Slurm
Here is the stack trace:
[2021-05-19T17:07:34] debug2: reframe: [CMD] 'sacct -S 2021-05-19 -P -j 31334100,31334101,31334102,31334103,31334106,31334109,31334110,31334112,31334114,31334116,31334118,3133411
9,31334120,31334121,31334122,31334123,31334124,31334125,31334126,31334127,31334128,31334131,31334132,31334133,31334134,31334135,31334136,31334139,31334142,31334143,31334145,31334
147,31334148,31334149,31334150,31334155,31334156,31334157,31334158,31334163,31334164,31334165,31334167,31334172,31334173,31334174,31334175,31334180,31334181,31334182,31334183,313
34188,31334189,31334190,31334191,31334196,31334197,31334198,31334199,31334204,31334205,31334206,31334207,31334212,31334213,31334214,31334215 -o jobid,state,exitcode,end,nodelist'
[2021-05-19T17:07:34] debug2: reframe: [CMD] 'squeue -h -j 31334100 -o %r'
[2021-05-19T17:07:34] info: reframe: [ FAILED ] Ran 8/125 test case(s) from 51 check(s) (0 failure(s), 0 skipped)
[2021-05-19T17:07:34] info: reframe: [==========] Finished on Wed May 19 17:07:34 2021
[2021-05-19T17:07:34] error: reframe: /apps/daint/UES/jenkins/7.0.UP02/reframe/software/reframe/3.6.0/bin/reframe: run session stopped: spawned process error: command 'squeue
-h -j 31334100 -o %r' failed with exit code 1:
--- stdout ---
--- stdout ---
--- stderr ---
slurm_load_jobs error: Invalid job id specified
--- stderr ---
[2021-05-19T17:07:34] verbose: reframe: Traceback (most recent call last):
File "/apps/daint/UES/jenkins/7.0.UP02/reframe/software/reframe/3.6.0/reframe/frontend/cli.py", line 998, in main
runner.runall(testcases, restored_cases)
File "/apps/daint/UES/jenkins/7.0.UP02/reframe/software/reframe/3.6.0/reframe/frontend/executors/__init__.py", line 431, in runall
self._runall(testcases)
File "/apps/daint/UES/jenkins/7.0.UP02/reframe/software/reframe/3.6.0/reframe/frontend/executors/__init__.py", line 504, in _runall
self._policy.exit()
File "/apps/daint/UES/jenkins/7.0.UP02/reframe/software/reframe/3.6.0/reframe/frontend/executors/policies.py", line 532, in exit
self._poll_tasks()
File "/apps/daint/UES/jenkins/7.0.UP02/reframe/software/reframe/3.6.0/reframe/frontend/executors/policies.py", line 446, in _poll_tasks
part.scheduler.poll(*part_jobs)
File "/apps/daint/UES/jenkins/7.0.UP02/reframe/software/reframe/3.6.0/reframe/core/schedulers/slurm.py", line 434, in poll
self._cancel_if_blocked(job)
File "/apps/daint/UES/jenkins/7.0.UP02/reframe/software/reframe/3.6.0/reframe/core/schedulers/slurm.py", line 467, in _cancel_if_blocked
completed = _run_strict('squeue -h -j %s -o %%r' % job.jobid)
File "/apps/daint/UES/jenkins/7.0.UP02/reframe/software/reframe/3.6.0/reframe/utility/osext.py", line 72, in run_command
completed.returncode)
reframe.core.exceptions.SpawnedProcessError: command 'squeue -h -j 31334100 -o %r' failed with exit code 1:
--- stdout ---
--- stdout ---
--- stderr ---
slurm_load_jobs error: Invalid job id specified
--- stderr ---
I don't think that it should have crashed, but I don't know either what would be the right behaviour. It seems that the following has happened: job was requeued, so pending from ReFrame's point of view, and then it tried to get the reason why it was pending by using squeue. Apparently, the job was not known to squeue and thus the hard error here.
Not reproducible, closing.
Maybe we shouldn't close this one just yet. We had this issue some weeks ago in CSCS. I paraphrase slightly from the jira issue that Luca had opened then:
ERROR: run session stopped: spawned process error: command 'squeue -h -j 2311469 -o %r' failed with exit code 1:
slurm_load_jobs error: Invalid job id specified
The jobid 2311469 that triggered the Slurm error was CANCELLED, then REQUEUED and finally COMPLETED: the nodelist was nid[001101-001116]. Therefore the corresponding check did not fail, but the command used to monitor the Slurm queue reported the error above. Maybe we could adapt the way currently ReFrame monitors the jobs in the queue, for instance with an increased waiting time or taking into account jobs requeueing. Please note that the nodelist of the jobid 2311469 included nodes that were affected by a DVS failure. Therefore it is unlikely that the unusual sequence of job states - CANCELLED, then REQUEUED and finally COMPLETED - would occur on a healthy file system mounted by compute nodes: the Slurm sbatch option
--no-requeueis also a way to avoid requeueing jobs.
Yesterday I had a problem with polling of a submitted job stopping because squeue failed for a transient error:
ERROR: run session stopped: spawned process error: command 'squeue -h -j 32564066 -o %r' failed with exit code 1:
--- stdout ---
--- stdout ---
--- stderr ---
slurm_load_jobs error: Socket timed out on send/recv operation
--- stderr ---
In the past I had related problems on a different cluster, using the SGE scheduler, where the polling command may file if the system is temporarily overloaded even if the job is still pending/running.
Strictly speaking, this isn't ReFrame's fault, but the net effect is that if there are some transient issues on the system which cause the polling command to fail, ReFrame stops watching the job and all the postprocessing (gathering sanity checks and/or performance metrics) is bailed out, which is rather frustrating. I think ReFrame should be more resilient in these cases, maybe by trying polling a few more times if the commands fails in an unexpected way?