[BUG] (launched with accelerate) Stage 3 backward RuntimeError : ProcessGroup nccldoes not support _reduce_scatter_base

Open memray opened this issue 2 years ago • 0 comments

Describe the bug Don't know why using Stage 3 triggers this error RuntimeError: ProcessGroup nccl does not support _reduce_scatter_base. Training with Stage 2 is fine. Is this related to my PyTorch (1.11.0+cu113)?

According to this page, reduce_scatter is supported by NCCL on GPU.

To Reproduce Trained on an A100*16 instance, failed with either 1gpu or 16gpus. Code: https://github.com/huggingface/accelerate/blob/main/examples/by_feature/deepspeed_with_config_support.py

Command: CUDA_VISIBLE_DEVICES=0 accelerate launch --config_file gpu1-Stage3-config.yaml deepspeed_with_config_support.py --model_name_or_path salesforce/codegen-2B-multi --dataset_name wikitext --dataset_config_name wikitext-103-v1 --per_device_train_batch_size 1 --per_device_eval_batch_size 2 --output_dir output/FT-mktcloud-test --max_train_steps 20 --num_warmup_steps 5 --with_tracking --learning_rate 1e-5

Traceback (most recent call last):
  File "deepspeed_with_config_support.py", line 732, in <module>
    main()
  File "deepspeed_with_config_support.py", line 641, in main
    accelerator.backward(loss)
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/accelerate/accelerator.py", line 1435, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 168, in backward
    self.engine.backward(loss)
......
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 268, in reduce_scatter_fn
    return reduce_scatter_base(output_tensor,
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 127, in log_wrapper
    return func(*args, **kwargs)
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 302, in reduce_scatter_base
    return cdb.reduce_scatter_base(output_tensor=output_tensor,
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 102, in reduce_scatter_base
    return torch.distributed._reduce_scatter_base(output_tensor,
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2484, in _reduce_scatter_base
    work = group._reduce_scatter_base(output, input, opts)
RuntimeError: ProcessGroup nccldoes not support _reduce_scatter_base```


│ /export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/ │
│ deepspeed/comm/comm.py:302 in reduce_scatter_base                            │
│                                                                              │
│   299 │   │   │   │   │   │   log_name='reduce_scatter_base',                │
│   300 │   │   │   │   │   │   debug=get_caller_func()):                      │
│   301 │   global cdb                                                         │
│ ❱ 302 │   return cdb.reduce_scatter_base(output_tensor=output_tensor,        │
│   303 │   │   │   │   │   │   │   │      input_tensor=tensor,                │
│   304 │   │   │   │   │   │   │   │      op=op,                              │
│   305 │   │   │   │   │   │   │   │      group=group,                        │
│                                                                              │
│ /export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/ │
│ deepspeed/comm/torch.py:102 in reduce_scatter_base                           │
│                                                                              │
│    99 │   │   │   │   │   │   │   group=None,                                │
│   100 │   │   │   │   │   │   │   async_op=False):                           │
│   101 │   │   if self.has_reduce_scatter_base:                               │
│ ❱ 102 │   │   │   return torch.distributed._reduce_scatter_base(output_tenso │
│   103 │   │   │   │   │   │   │   │   │   │   │   │   │   │     input_tensor │
│   104 │   │   │   │   │   │   │   │   │   │   │   │   │   │     op=self._red │
│   105 │   │   │   │   │   │   │   │   │   │   │   │   │   │     group=group, │
│                                                                              │
│ /export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/ │
│ torch/distributed/distributed_c10d.py:2484 in _reduce_scatter_base           │
│                                                                              │
│   2481 │   │   default_pg = _get_default_group()                             │
│   2482 │   │   work = default_pg._reduce_scatter_base(output, input, opts)   │
│   2483 │   else:                                                             │
│ ❱ 2484 │   │   work = group._reduce_scatter_base(output, input, opts)        │
│   2485 │                                                                     │
│   2486 │   if async_op:                                                      │
│   2487 │   │   return work                                                   │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: ProcessGroup nccldoes not support _reduce_scatter_base

Expected behavior No error.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0+cu113
deepspeed install path ........... ['/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.8.0, unknown, unknown
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3

System info (please complete the following information):

OS: Ubuntu 20.04.4 LTS
one machines with x16 A100s
Python version Python 3.8.12

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else? No, using accelerate.

Code: https://github.com/huggingface/accelerate/blob/main/examples/by_feature/deepspeed_with_config_support.py

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 accelerate launch --config_file gpu16-Stage-3-config.yaml deepspeed_with_config_support.py --model_name_or_path salesforce/codegen-2B-multi --dataset_name wikitext --dataset_config_name wikitext-103-v1 --per_device_train_batch_size 1 --per_device_eval_batch_size 2 --output_dir output/FT-mktcloud-test --max_train_steps 20 --num_warmup_steps 5 --with_tracking --learning_rate 1e-5

accelerate config:

- `Accelerate` version: 0.16.0
- Platform: Linux-5.10.133+-x86_64-with-glibc2.17
- Python version: 3.8.12
- Numpy version: 1.22.2
- PyTorch version (GPU?): 1.11.0+cu113 (True)
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: DEEPSPEED
	- mixed_precision: bf16
	- use_cpu: False
	- dynamo_backend: NO
	- num_processes: 16
	- machine_rank: 0
	- num_machines: 1
	- rdzv_backend: static
	- same_network: True
	- main_training_function: main
	- deepspeed_config: {'gradient_accumulation_steps': 4, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': True, 'zero_stage': 2}
	- fsdp_config: {}
	- megatron_lm_config: {}
	- downcast_bf16: no

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

Feb 21 '23 20:02 memray