[BUG] (launched with accelerate) Stage 3 backward RuntimeError : ProcessGroup nccldoes not support _reduce_scatter_base
Describe the bug
Don't know why using Stage 3 triggers this error RuntimeError: ProcessGroup nccl does not support _reduce_scatter_base. Training with Stage 2 is fine. Is this related to my PyTorch (1.11.0+cu113)?
According to this page, reduce_scatter is supported by NCCL on GPU.
To Reproduce Trained on an A100*16 instance, failed with either 1gpu or 16gpus. Code: https://github.com/huggingface/accelerate/blob/main/examples/by_feature/deepspeed_with_config_support.py
Command:
CUDA_VISIBLE_DEVICES=0 accelerate launch --config_file gpu1-Stage3-config.yaml deepspeed_with_config_support.py --model_name_or_path salesforce/codegen-2B-multi --dataset_name wikitext --dataset_config_name wikitext-103-v1 --per_device_train_batch_size 1 --per_device_eval_batch_size 2 --output_dir output/FT-mktcloud-test --max_train_steps 20 --num_warmup_steps 5 --with_tracking --learning_rate 1e-5
Traceback (most recent call last):
File "deepspeed_with_config_support.py", line 732, in <module>
main()
File "deepspeed_with_config_support.py", line 641, in main
accelerator.backward(loss)
File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/accelerate/accelerator.py", line 1435, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 168, in backward
self.engine.backward(loss)
......
File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 268, in reduce_scatter_fn
return reduce_scatter_base(output_tensor,
File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 127, in log_wrapper
return func(*args, **kwargs)
File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 302, in reduce_scatter_base
return cdb.reduce_scatter_base(output_tensor=output_tensor,
File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 102, in reduce_scatter_base
return torch.distributed._reduce_scatter_base(output_tensor,
File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2484, in _reduce_scatter_base
work = group._reduce_scatter_base(output, input, opts)
RuntimeError: ProcessGroup nccldoes not support _reduce_scatter_base```
│ /export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/ │
│ deepspeed/comm/comm.py:302 in reduce_scatter_base │
│ │
│ 299 │ │ │ │ │ │ log_name='reduce_scatter_base', │
│ 300 │ │ │ │ │ │ debug=get_caller_func()): │
│ 301 │ global cdb │
│ ❱ 302 │ return cdb.reduce_scatter_base(output_tensor=output_tensor, │
│ 303 │ │ │ │ │ │ │ │ input_tensor=tensor, │
│ 304 │ │ │ │ │ │ │ │ op=op, │
│ 305 │ │ │ │ │ │ │ │ group=group, │
│ │
│ /export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/ │
│ deepspeed/comm/torch.py:102 in reduce_scatter_base │
│ │
│ 99 │ │ │ │ │ │ │ group=None, │
│ 100 │ │ │ │ │ │ │ async_op=False): │
│ 101 │ │ if self.has_reduce_scatter_base: │
│ ❱ 102 │ │ │ return torch.distributed._reduce_scatter_base(output_tenso │
│ 103 │ │ │ │ │ │ │ │ │ │ │ │ │ │ input_tensor │
│ 104 │ │ │ │ │ │ │ │ │ │ │ │ │ │ op=self._red │
│ 105 │ │ │ │ │ │ │ │ │ │ │ │ │ │ group=group, │
│ │
│ /export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/ │
│ torch/distributed/distributed_c10d.py:2484 in _reduce_scatter_base │
│ │
│ 2481 │ │ default_pg = _get_default_group() │
│ 2482 │ │ work = default_pg._reduce_scatter_base(output, input, opts) │
│ 2483 │ else: │
│ ❱ 2484 │ │ work = group._reduce_scatter_base(output, input, opts) │
│ 2485 │ │
│ 2486 │ if async_op: │
│ 2487 │ │ return work │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: ProcessGroup nccldoes not support _reduce_scatter_base
Expected behavior No error.
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0+cu113
deepspeed install path ........... ['/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.8.0, unknown, unknown
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3
System info (please complete the following information):
- OS: Ubuntu 20.04.4 LTS
- one machines with x16 A100s
- Python version Python 3.8.12
Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?
No, using accelerate.
Code: https://github.com/huggingface/accelerate/blob/main/examples/by_feature/deepspeed_with_config_support.py
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 accelerate launch --config_file gpu16-Stage-3-config.yaml deepspeed_with_config_support.py --model_name_or_path salesforce/codegen-2B-multi --dataset_name wikitext --dataset_config_name wikitext-103-v1 --per_device_train_batch_size 1 --per_device_eval_batch_size 2 --output_dir output/FT-mktcloud-test --max_train_steps 20 --num_warmup_steps 5 --with_tracking --learning_rate 1e-5
accelerate config:
- `Accelerate` version: 0.16.0
- Platform: Linux-5.10.133+-x86_64-with-glibc2.17
- Python version: 3.8.12
- Numpy version: 1.22.2
- PyTorch version (GPU?): 1.11.0+cu113 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: bf16
- use_cpu: False
- dynamo_backend: NO
- num_processes: 16
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 4, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': True, 'zero_stage': 2}
- fsdp_config: {}
- megatron_lm_config: {}
- downcast_bf16: no
Docker context Are you using a specific docker image that you can share?
Additional context Add any other context about the problem here.