Yet another `BatchtoolsExpiration: Future ('<none>') expired` error
Executing the following snippet leads to the exception Error: BatchtoolsExpiration: Future ('<none>') expired (registry path [..]). on our LSF powered cluster.
library(future)
library("future.batchtools")
library(furrr)
plan(batchtools_lsf, template = "lsf-simple.tmpl", resources = list(queue = "gpu.4h", walltime = 60 * 60 * 4, memory = "5000", core_num = 2))
future_map_dfr(1:10, function(x) { data.frame(x = x , y = x^2) })
plan(sequential)
with the following lsf-simple.tmpl:
#BSUB -J <%= job.name %>
#BSUB -o <%= log.file %>
#BSUB -q <%= resources$queue %>
#BSUB -W <%= round(resources$walltime / 60, 1) %> # resources$walltime in seconds
#BSUB -M <%= resources$memory %>
#BSUB -R "rusage[mem=<%= resources$memory %>, ngpus_excl_p=1]"
#BSUB -n <%= resources$core_num %>
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
This topic has already been discussed in various settings:
- https://github.com/mllg/batchtools/issues/240: too many jobs at the same time
- https://github.com/HenrikBengtsson/future.batchtools/issues/48: busy device
- https://github.com/HenrikBengtsson/future.batchtools/issues/31: killed by job system
Possible solutions have been suggested for SLURM (https://github.com/HenrikBengtsson/future.batchtools/issues/74, https://github.com/mllg/batchtools/issues/273).
I am currently trying to resolve these issues for LSF and I do not think the three topics mentioned above apply. This is because 1) I am spawning a very small number of jobs, 2) no error message was reported and 3) the LSF job status is set to DONE for jobs which expired:
[R script]
Error: BatchtoolsExpiration: Future ('<none>') expired (registry path [..]).. The last few lines of the logged output:
Sender: LSF System <[..]>
Subject: Job 212072182: <jobc6330b006f2ac3311db1511449421e23> in cluster <[..]> Done
Job <jobc6330b006f2ac3311db1511449421e23> was submitted from host <[..]> by user <[..]> in cluster <[..]> at Fri Apr 1 14:24:57 2022
[..]
$ bjobs 212072182
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
212072182 [..] DONE gpu.4h [..] [..] *449421e23 Apr 1 14:24
I'd be excited to hear your thoughts on this. Did I do something wrong? Or is there a way of fixing this?