DeepSpeed
DeepSpeed copied to clipboard
[BUG] NCCL out of memory on `save_checkpoint()`
Describe the bug
I'm training a model and try to save it using save_checkpoint after the first epoch. Training (with stage 0, bf16) goes smoothly, but I get an NCCL error when I try to save. Is this a known issue, and is there a way around this?
Traceback (most recent call last):
File "/mnt/tier2/users/u100424/repos/signal-peptide-generation/src/run_pretraining.py", line 3, in <module>
train(get_args())
File "/mnt/tier2/users/u100424/repos/signal-peptide-generation/src/pretraining/train_loop_ds.py", line 134, in train
model_engine_transformer.save_checkpoint(os.path.join(args.name, 'transformer'), ckpt_id)
File "/home/users/u100424/miniconda3/envs/spgen/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3088, in save_checkpoint
dist.barrier()
File "/home/users/u100424/miniconda3/envs/spgen/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 127, in log_wrapper
return func(*args, **kwargs)
File "/home/users/u100424/miniconda3/envs/spgen/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 469, in barrier
return cdb.barrier(group=group, async_op=async_op, device_ids=device_ids)
File "/home/users/u100424/miniconda3/envs/spgen/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 160, in barrier
return torch.distributed.barrier(group=group,
File "/home/users/u100424/miniconda3/envs/spgen/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3147, in barrier
work = group.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Proxy Call to rank 2 failed (Setup)
mel2045:30570:32483 [2] proxy.cc:1119 NCCL WARN [Proxy Service 2] Failed to execute operation Setup from rank 2, retcode 1
mel2045:30570:32483 [2] include/alloc.h:99 NCCL WARN Cuda failure 'out of memory'
mel2045:30570:32483 [2] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 6291456 bytes
mel2045:30570:32483 [2] NCCL INFO transport/p2p.cc:430 -> 1
mel2045:30570:32483 [2] NCCL INFO proxy.cc:989 -> 1
Expected behavior I did not expect saving to need more memory than training.
ds_report output
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [92m[OKAY][0m
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io: please install the libaio-devel package with yum
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [93m[NO][0m ....... [93m[NO][0m
cpu_adagrad ............ [93m[NO][0m ....... [92m[OKAY][0m
cpu_adam ............... [93m[NO][0m ....... [92m[OKAY][0m
fused_adam ............. [93m[NO][0m ....... [92m[OKAY][0m
fused_lamb ............. [93m[NO][0m ....... [92m[OKAY][0m
quantizer .............. [93m[NO][0m ....... [92m[OKAY][0m
random_ltd ............. [93m[NO][0m ....... [92m[OKAY][0m
[93m [WARNING] [0m please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [93m[NO][0m ....... [93m[NO][0m
spatial_inference ...... [93m[NO][0m ....... [92m[OKAY][0m
transformer ............ [93m[NO][0m ....... [92m[OKAY][0m
stochastic_transformer . [93m[NO][0m ....... [92m[OKAY][0m
transformer_inference .. [93m[NO][0m ....... [92m[OKAY][0m
utils .................. [93m[NO][0m ....... [92m[OKAY][0m
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/users/u100424/miniconda3/envs/spgen/lib/python3.9/site-packages/torch']
torch version .................... 1.13.0+cu117
deepspeed install path ........... ['/home/users/u100424/miniconda3/envs/spgen/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.8.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
System info (please complete the following information):
- Red Hat Enterprise Linux 8.6 (Ootpa)
- 4x A100
- Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
- Python version 3.9.15
Launcher context deepspeed launcher