Multi-node checkpoint/restart benchmark on Fronterra

Open ChristopherHogan opened this issue 4 years ago • 1 comments

We want to model a checkpoint/restart workload.

4 client nodes, with 48 ranks each (out of 56)
Treat ${SCRATCH} as the PFS, or final destination.
Tiers are /tmp, which is an nvme, and RAM.
Run Hermes as a daemon
Run ior -w to simulate a checkpoint
Run ior -r to simulate a restart
For the baseline, the checkpoint phase will exit once the data is flushed to PFS, and the restart phase will read from PFS.
Hermes will store the checkpoint in the hierarchy and we should see faster write and read.

Blocked by ~~#181~~, and #266.

Oct 29 '21 19:10 ChristopherHogan

Hurdles

Slurm doesn't always allocate sequential hostnames. See #272.
The local SSDs are slower than the PFS, which effectively only gives us a single tier (RAM).
The idea of starting Hermes daemons, then reading and writing from different groups of nodes doesn't work because currently the number of daemon nodes must be the same as the number of client nodes, and each node must have one daemon.

Nov 05 '21 21:11 ChristopherHogan