hermes icon indicating copy to clipboard operation
hermes copied to clipboard

Multi-node checkpoint/restart benchmark on Fronterra

Open ChristopherHogan opened this issue 4 years ago • 1 comments

We want to model a checkpoint/restart workload.

  • 4 client nodes, with 48 ranks each (out of 56)
  • Treat ${SCRATCH} as the PFS, or final destination.
  • Tiers are /tmp, which is an nvme, and RAM.
  • Run Hermes as a daemon
  • Run ior -w to simulate a checkpoint
  • Run ior -r to simulate a restart
  • For the baseline, the checkpoint phase will exit once the data is flushed to PFS, and the restart phase will read from PFS.
  • Hermes will store the checkpoint in the hierarchy and we should see faster write and read.

Blocked by ~~#181~~, and #266.

ChristopherHogan avatar Oct 29 '21 19:10 ChristopherHogan

Hurdles

  • Slurm doesn't always allocate sequential hostnames. See #272.
  • The local SSDs are slower than the PFS, which effectively only gives us a single tier (RAM).
  • The idea of starting Hermes daemons, then reading and writing from different groups of nodes doesn't work because currently the number of daemon nodes must be the same as the number of client nodes, and each node must have one daemon.

ChristopherHogan avatar Nov 05 '21 21:11 ChristopherHogan