CoreNeuron icon indicating copy to clipboard operation
CoreNeuron copied to clipboard

Enable auto checkpointing on SIGTERM

Open ferdonline opened this issue 4 years ago • 3 comments

Motivation As a follow up to #252, we want CoreNeuron to be able to create checkpoints right before an allocation expires. Since most job schedulers send a SIGTERM before sigkill, we implement a handler for such signal. It may, however, be needed to tune the time to sending this signal, since long simulations may take a bit of time to write everything out.

Implementation Checkpoints are created in a folder _corenrn_ckpt inside the output root only if a minimum amount of time elapsed. This directory is checked for existence on startup if no --restore is provided.

ferdonline avatar May 10 '21 18:05 ferdonline

I don't really understand why the CI could fail in GPU. @pramodk Ideas?

ferdonline avatar May 11 '21 16:05 ferdonline

I don't really understand why the CI could fail in GPU. @pramodk Ideas?

sorry for delay - this issue is being investigated. You can ignore this error.

pramodk avatar May 17 '21 18:05 pramodk

Can you rebase this @ferdonline?

olupton avatar Jun 30 '21 08:06 olupton