Fix comms benchmark import issues and support MPI/slurm launching
Currently users need do either launch with ds_bench, deepspeed or append the DeepSpeed repo to their PYTHONPATH to avoid an import error when launching communication benchmarks.
These changes both fix the import errors by automatically appending the python sys path before import, and enable automatic MPI/slurm environment discovery similar to how we detect it in [deepspeed/comm.py](https://github.com/microsoft/DeepSpeed/blob/4ae3a3da0dfd19d7ab7a76e7c742ac12f44fc1c0/deepspeed/comm/comm.py#L664). This also enables MPI launching for both the pytorch distributed and deepspeed comm backends.
Tested on up to 2k GPUs.
I also updated the README to be more specific and comprehensive.
@jeffra @awan-10
These test failures look unrelated to the PR
Looks like some conflicts are showing up with the changes in https://github.com/microsoft/DeepSpeed/commit/0acf7e9c489def74d3ac05ee91eabfb8280a1dba. Specifically the top-level deepspeed imports introduced.
The reason for runtime-checks like https://github.com/microsoft/DeepSpeed/blob/3879a503ae601deb7ad9a3758d488cc91889a08f/benchmarks/communication/all_gather.py#L40 and the lack of top-level deepspeed imports is to enable users to benchmark communication ops using pure torch.distributed even without deepspeed installed (e.g. users of NeMo, Megatron-LM, fairseq, ColossalAI, etc).
I'm of the opinion that we should preserve this behavior for the comms benchmarks (i.e. no deepspeed.* top-level imports and wrap all deepspeed- or torch.distributed-specific code in an if args.dist == 'torch/deepspeed' check), but I think someone from the deepspeed team like @tjruwase and/or @jeffra have the final say. Can one of you provide comment?
Conflicts are resolved, and CI failures look unrelated @jeffra
I'm not sure why formatting CI tests are failing now. I ran pre-commit checks before committing, and I can't seem to reproduce these failures on my end.
@Quentin-Anthony, check your pre-commit and clang-format versions. Upgrading worked for me.
Ok all checks are finally passing!