batchtools icon indicating copy to clipboard operation
batchtools copied to clipboard

Yet another `BatchtoolsExpiration: Future ('<none>') expired` error

Open kpj opened this issue 3 years ago • 0 comments

Executing the following snippet leads to the exception Error: BatchtoolsExpiration: Future ('<none>') expired (registry path [..]). on our LSF powered cluster.

library(future)
library("future.batchtools")
library(furrr)

plan(batchtools_lsf, template = "lsf-simple.tmpl", resources = list(queue = "gpu.4h", walltime = 60 * 60 * 4, memory = "5000", core_num = 2))
future_map_dfr(1:10, function(x) { data.frame(x = x , y = x^2) })
plan(sequential)

with the following lsf-simple.tmpl:

#BSUB -J <%= job.name %>
#BSUB -o <%= log.file %>
#BSUB -q <%= resources$queue %>
#BSUB -W <%= round(resources$walltime / 60, 1) %>    # resources$walltime in seconds
#BSUB -M <%= resources$memory %>
#BSUB -R "rusage[mem=<%= resources$memory %>, ngpus_excl_p=1]"
#BSUB -n <%= resources$core_num %>

Rscript -e 'batchtools::doJobCollection("<%= uri %>")'

This topic has already been discussed in various settings:

  • https://github.com/mllg/batchtools/issues/240: too many jobs at the same time
  • https://github.com/HenrikBengtsson/future.batchtools/issues/48: busy device
  • https://github.com/HenrikBengtsson/future.batchtools/issues/31: killed by job system

Possible solutions have been suggested for SLURM (https://github.com/HenrikBengtsson/future.batchtools/issues/74, https://github.com/mllg/batchtools/issues/273).

I am currently trying to resolve these issues for LSF and I do not think the three topics mentioned above apply. This is because 1) I am spawning a very small number of jobs, 2) no error message was reported and 3) the LSF job status is set to DONE for jobs which expired:

[R script]
Error: BatchtoolsExpiration: Future ('<none>') expired (registry path [..]).. The last few lines of the logged output:
Sender: LSF System <[..]>
Subject: Job 212072182: <jobc6330b006f2ac3311db1511449421e23> in cluster <[..]> Done
Job <jobc6330b006f2ac3311db1511449421e23> was submitted from host <[..]> by user <[..]> in cluster <[..]> at Fri Apr  1 14:24:57 2022
[..]
$ bjobs 212072182
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
212072182  [..]    DONE  gpu.4h     [..]        [..]        *449421e23 Apr  1 14:24

I'd be excited to hear your thoughts on this. Did I do something wrong? Or is there a way of fixing this?

kpj avatar Apr 01 '22 12:04 kpj