Checkpoint support
We have no checkpointing support, but it sure would be good to have.
Nothing concrete here on this issue: open to collecting comments regarding use cases, design ideas, etc.
I think it would be nice to have the opportunity to keep low the memory usage by loading only the data of a defined number of timesteps back in the past. In other words, one would be able to:
- create a checkpoint containing the state of the simulation at a certain biological/simulated time
t_1on permanent storage, - load the stored state of the simulation while disregarding any timesteps before
t_0 < t_1.
This way, it would become possible to run - by pieces - very long simulations while still being able to perform many recordings. The timespan t_1-t_0 should be adjustable to account for specifics of the model that is being used (e.g., for synaptic delay). This could possibly be related to issue #1232.
Another useful application would be in removing transients from computational heavy simulations. For example the first 10 seconds in the IO network simulations, taking ~ 5-20 minutes real time depending on inputs, is just the transient we need to discard before taking actual measurements. If we could snapshot the full network state after this 10s transient and resume measurements from there, each time with different stimuli for example, this would greatly reduce (wasted) simulation time.
This is a very slow moving project, sorry for that. Let's move it a bit forward so we can hammer out an implementation.
- What state precisely would you need to dump/restore?
- Network connections
- Current spikes in-flight
- Current spike counts
- Mechanism internals
- ...?
- What are the time/performance constraints?
- How often do we expect to call this?
- Is it ok if it takes 1s/10s/100s/...?
- What is the disk format?
- JSON?
- text, human-readable
- no concurrent access
- HDF5?
- binary; needs external tools to inspect
- parallel access
- heavy dependency
- potentially annoying to get right
- Others?
- Zarr
- ...
- One file per
- job: requires parallel I/O
- rank: doesnt scale under resart w/ different topology
- gid: many, many files, kills PFS
- JSON?
My (might not be representative for all neuroscientists) answers:
What state precisely would you need to dump/restore?
Enough information to continue the simulation as if it had not been checkpointed (thus including in-flight spikes, mechanism state and connectivity) - but discard all things like recordings, spike counts and other information that is extracted from the simulation but does not influence it (well of course you could do that explicitly, but then it's your own problem). If I want to save those I can do that manually.
What are the time/performance constraints?
Not much. I would not call this too much (in the scale of multiple minutes at most), so it's fine to take a while. It would be nice if the simulation could be started quickly, but that's not too important
It would start to matter if running from a checkpoint with ~10 different small parameter variations every n seconds would become a thing (eg. some kind of kalman filtering). But then you'd also want in memory checkpoints for speed as opposed to disk checkpoints.... maybe a different story
What is the disk format?
Doesn't matter, I don't need to inspect it (although that allow for pretty visualizations). Probably some kind of binary format. Maybe protocol buffers/flatbuffers or similar.
One file per ...
My model is a single cell group so does not apply. But this is indeed pretty difficult to implement in a multi-(core/node/gpu) system. Restart with different a topology is not something I'd do. A stop-the-world method where all threads/ranks/processes send all their data to a single thread that builds the checkpoint file sounds good to me. In theory the checkpoint format should not be tied to the arbor context - a saved gpu simulation should later be able to run on a cpu - but this may be hard to implement in practice.
For our purposes (network simulations with a few thousand neurons), I can basically agree with everything that @llandsmeer has mentioned.
Regarding the point
What are the time/performance constraints?
we would just need to call the checkpoint loading twice in a simulation of 1h or so.
Thousands of cells implies the need for a (potentialy compressed) binary format? Or would ASCII be fine? A quick guesstimate for a medium sized network
O(1000) cells x O(1000) CVs x O(10) arrays to write x O(10) Bytes/Value = O(100) MB
Would be nice to be able to check the checkpoint files into version control or something for 'reproducible research' - so 100MB does sounds large (also the guestimated order-of-magnitude size my networks would end up being). So definitely compression (just lzma on binary/ascii whatever works) but if that's not possible it's the way
But compressed/binary files in VC is also an issue of its own.