hermes
hermes copied to clipboard
Multi-node checkpoint/restart benchmark on Fronterra
We want to model a checkpoint/restart workload.
- 4 client nodes, with 48 ranks each (out of 56)
- Treat
${SCRATCH}as the PFS, or final destination. - Tiers are
/tmp, which is an nvme, andRAM. - Run Hermes as a daemon
- Run
ior -wto simulate a checkpoint - Run
ior -rto simulate a restart - For the baseline, the checkpoint phase will exit once the data is flushed to PFS, and the restart phase will read from PFS.
- Hermes will store the checkpoint in the hierarchy and we should see faster write and read.
Blocked by ~~#181~~, and #266.
Hurdles
- Slurm doesn't always allocate sequential hostnames. See #272.
- The local SSDs are slower than the PFS, which effectively only gives us a single tier (RAM).
- The idea of starting Hermes daemons, then reading and writing from different groups of nodes doesn't work because currently the number of daemon nodes must be the same as the number of client nodes, and each node must have one daemon.