CESM_postprocessing icon indicating copy to clipboard operation
CESM_postprocessing copied to clipboard

batch jobs do not abort when error occurs

Open lvankampenhout opened this issue 7 years ago • 3 comments

Problem: whenever an error occurs somewhere down in the Python code, the batch job hangs and does not abort. When I login onto the compute note I see that there is 100% CPU usage. Not sure if this is a feature of my local cluster (I ported the scripts to SLURM cluster Cartesius) or the postprocessing scripts themselves. However it is clearly sub-optimal because the jobs need to be manually aborted.

lvankampenhout avatar Nov 06 '18 16:11 lvankampenhout

@lvankampenhout - is there a particular postprocessing task where the job doesn't abort correctly?

bertinia avatar Nov 07 '18 16:11 bertinia

Hi Alice, I encountered this issue with both the lnd_averages and timeseries tasks.

lvankampenhout avatar Nov 08 '18 07:11 lvankampenhout

Strangely enough, my jobs do abort today.

lvankampenhout avatar Nov 09 '18 10:11 lvankampenhout