benchmark
benchmark copied to clipboard
ddp experiments: run all measurements on the same allocation
Previously, we would launch a different slurm job for measurement, e.g.:
- resnet50 w/ inductor + graph breaks, 2 node
- resnet50 w/ inductor + NO graph breaks, 2 node
- ...
But there's a of variation between different nodes - possibly due to network topology etc. These slurm jobs would often each be submitted to a different set of nodes, which would add a lot of noise to the data.
So instead, with this PR, we do the following:
- allocate enough nodes to run all the measurements
- launch (8 * max_nodes) jobs, and provide a list of measurements to each of the jobs
- in each job, we iterate through the list of measurements. For each measurement, we spawn a new process that runs the measurement.
- Once the new process exits, we synchronize via a barrier that's implemented via torch.distributed's FileStore.
- Note that if we have, say, 8 nodes, and one measurement only requires 4 nodes, then the other 4 nodes won't launch any jobs and will just sit idle waiting on the barrier.