DeepSpeed Fix comms benchmark import issues and support MPI/slurm launching

Currently users need do either launch with ds_bench, deepspeed or append the DeepSpeed repo to their PYTHONPATH to avoid an import error when launching communication benchmarks.

These changes both fix the import errors by automatically appending the python sys path before import, and enable automatic MPI/slurm environment discovery similar to how we detect it in [deepspeed/comm.py](https://github.com/microsoft/DeepSpeed/blob/4ae3a3da0dfd19d7ab7a76e7c742ac12f44fc1c0/deepspeed/comm/comm.py#L664). This also enables MPI launching for both the pytorch distributed and deepspeed comm backends.

Tested on up to 2k GPUs.

I also updated the README to be more specific and comprehensive.

@jeffra @awan-10

Mar 03 '23 04:03 Quentin-Anthony

These test failures look unrelated to the PR

Mar 04 '23 15:03 Quentin-Anthony

Looks like some conflicts are showing up with the changes in https://github.com/microsoft/DeepSpeed/commit/0acf7e9c489def74d3ac05ee91eabfb8280a1dba. Specifically the top-level deepspeed imports introduced.

The reason for runtime-checks like https://github.com/microsoft/DeepSpeed/blob/3879a503ae601deb7ad9a3758d488cc91889a08f/benchmarks/communication/all_gather.py#L40 and the lack of top-level deepspeed imports is to enable users to benchmark communication ops using pure torch.distributed even without deepspeed installed (e.g. users of NeMo, Megatron-LM, fairseq, ColossalAI, etc).

I'm of the opinion that we should preserve this behavior for the comms benchmarks (i.e. no deepspeed.* top-level imports and wrap all deepspeed- or torch.distributed-specific code in an if args.dist == 'torch/deepspeed' check), but I think someone from the deepspeed team like @tjruwase and/or @jeffra have the final say. Can one of you provide comment?

Mar 13 '23 22:03 Quentin-Anthony

Conflicts are resolved, and CI failures look unrelated @jeffra

Mar 24 '23 19:03 Quentin-Anthony

I'm not sure why formatting CI tests are failing now. I ran pre-commit checks before committing, and I can't seem to reproduce these failures on my end.

Mar 26 '23 16:03 Quentin-Anthony

@Quentin-Anthony, check your pre-commit and clang-format versions. Upgrading worked for me.

Mar 27 '23 01:03 tjruwase

Ok all checks are finally passing!

Mar 27 '23 20:03 Quentin-Anthony