ddp experiments: run all measurements on the same allocation

Open davidberard98 opened this issue 3 years ago • 0 comments

Previously, we would launch a different slurm job for measurement, e.g.:

resnet50 w/ inductor + graph breaks, 2 node
resnet50 w/ inductor + NO graph breaks, 2 node
...

But there's a of variation between different nodes - possibly due to network topology etc. These slurm jobs would often each be submitted to a different set of nodes, which would add a lot of noise to the data.

So instead, with this PR, we do the following:

allocate enough nodes to run all the measurements
launch (8 * max_nodes) jobs, and provide a list of measurements to each of the jobs
in each job, we iterate through the list of measurements. For each measurement, we spawn a new process that runs the measurement.
Once the new process exits, we synchronize via a barrier that's implemented via torch.distributed's FileStore.
Note that if we have, say, 8 nodes, and one measurement only requires 4 nodes, then the other 4 nodes won't launch any jobs and will just sit idle waiting on the barrier.

Oct 14 '22 04:10 davidberard98