pecan icon indicating copy to clipboard operation
pecan copied to clipboard

`runModule_start_model_runs()` calls `remote.copy.from()` immediately if job id is `NULL`

Open Aariq opened this issue 3 years ago • 3 comments

Bug Description

runModule_start_model_runs() should, I think, copy the run directory over to the HPC, then launch jobs on the HPC, then wait for them to be done (with qsub_run_finished()), then copy the out directory back (with remote.copy.from()). This isn't working for me because the jobid is NULL. Here's a section of the output from runModule_start_model_runs():

2022-07-12 21:42:26 DEBUG  [remote.execute.cmd] : 
   ssh -T -l ericrscott puma 'squeue --job NULL &> /dev/null || echo DONE' 
2022-07-12 21:42:27 DEBUG  [PEcAn.remote::qsub_run_finished] : 
   Job NULL for run NULL finished 
2022-07-12 21:42:27 DEBUG  [PEcAn.remote::remote.copy.from] : 
   rsync '-az' '-q' 
   'ericrscott@puma:/groups/dlebauer/ed2_results/pecan_remote/2022-07-12-21-39-43/out/ENS-00005-678' 
   '/home/ericrscott/Eric-ED2/WLEF/outputs/out' 

I'm not entirely sure what the fix is.

To Reproduce

I think I'd need some guidance on how to reproduce this.

Expected behavior

I'd expect the R session to wait until the HPC runs were finished and results were copied back over. Additionally, it should fail fast in the case the the job ID is not valid (e.g. if it's NULL).

Machine (please complete the following information):

  • Server [welsch.cyverse.org]

I think @dlebauer and @KristinaRiemer are aware of this bug as well, but couldn't find an open issue.

Aariq avatar Jul 12 '22 21:07 Aariq

Ok, so I think the problem on my end is an incorrect <qsub.jobid>, but I still think this should fail fast if the job ID is NULL

Aariq avatar Jul 13 '22 19:07 Aariq

So to clarify, here's what I think should happen if the job ID is NULL for any reason (e.g. an incorrect pattern in <qsub.jobid>`):

  1. print a message like "Job ID is NULL. Jobs are running but won't be automatically retrieved from host. Check hopst$qsub.jobid in settings"
  2. remote.copy.from() does NOT get called and the function exits.

Aariq avatar Jul 13 '22 20:07 Aariq

This issue is stale because it has been open 365 days with no activity.

github-actions[bot] avatar Jul 14 '23 00:07 github-actions[bot]