DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Fix comms benchmark import issues and support MPI/slurm launching

Open Quentin-Anthony opened this issue 2 years ago • 2 comments

Currently users need do either launch with ds_bench, deepspeed or append the DeepSpeed repo to their PYTHONPATH to avoid an import error when launching communication benchmarks.

These changes both fix the import errors by automatically appending the python sys path before import, and enable automatic MPI/slurm environment discovery similar to how we detect it in [deepspeed/comm.py](https://github.com/microsoft/DeepSpeed/blob/4ae3a3da0dfd19d7ab7a76e7c742ac12f44fc1c0/deepspeed/comm/comm.py#L664). This also enables MPI launching for both the pytorch distributed and deepspeed comm backends.

Tested on up to 2k GPUs.

I also updated the README to be more specific and comprehensive.

@jeffra @awan-10

Quentin-Anthony avatar Mar 03 '23 04:03 Quentin-Anthony

These test failures look unrelated to the PR

Quentin-Anthony avatar Mar 04 '23 15:03 Quentin-Anthony

Looks like some conflicts are showing up with the changes in https://github.com/microsoft/DeepSpeed/commit/0acf7e9c489def74d3ac05ee91eabfb8280a1dba. Specifically the top-level deepspeed imports introduced.

The reason for runtime-checks like https://github.com/microsoft/DeepSpeed/blob/3879a503ae601deb7ad9a3758d488cc91889a08f/benchmarks/communication/all_gather.py#L40 and the lack of top-level deepspeed imports is to enable users to benchmark communication ops using pure torch.distributed even without deepspeed installed (e.g. users of NeMo, Megatron-LM, fairseq, ColossalAI, etc).

I'm of the opinion that we should preserve this behavior for the comms benchmarks (i.e. no deepspeed.* top-level imports and wrap all deepspeed- or torch.distributed-specific code in an if args.dist == 'torch/deepspeed' check), but I think someone from the deepspeed team like @tjruwase and/or @jeffra have the final say. Can one of you provide comment?

Quentin-Anthony avatar Mar 13 '23 22:03 Quentin-Anthony

Conflicts are resolved, and CI failures look unrelated @jeffra

Quentin-Anthony avatar Mar 24 '23 19:03 Quentin-Anthony

I'm not sure why formatting CI tests are failing now. I ran pre-commit checks before committing, and I can't seem to reproduce these failures on my end.

Quentin-Anthony avatar Mar 26 '23 16:03 Quentin-Anthony

@Quentin-Anthony, check your pre-commit and clang-format versions. Upgrading worked for me.

tjruwase avatar Mar 27 '23 01:03 tjruwase

Ok all checks are finally passing!

Quentin-Anthony avatar Mar 27 '23 20:03 Quentin-Anthony