[BUG] NCCL out of memory on `save_checkpoint()`

Open fteufel opened this issue 2 years ago • 0 comments

Describe the bug I'm training a model and try to save it using save_checkpoint after the first epoch. Training (with stage 0, bf16) goes smoothly, but I get an NCCL error when I try to save. Is this a known issue, and is there a way around this?

Traceback (most recent call last):
  File "/mnt/tier2/users/u100424/repos/signal-peptide-generation/src/run_pretraining.py", line 3, in <module>
    train(get_args())
  File "/mnt/tier2/users/u100424/repos/signal-peptide-generation/src/pretraining/train_loop_ds.py", line 134, in train
    model_engine_transformer.save_checkpoint(os.path.join(args.name, 'transformer'), ckpt_id)
  File "/home/users/u100424/miniconda3/envs/spgen/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3088, in save_checkpoint
    dist.barrier()
  File "/home/users/u100424/miniconda3/envs/spgen/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 127, in log_wrapper
    return func(*args, **kwargs)
  File "/home/users/u100424/miniconda3/envs/spgen/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 469, in barrier
    return cdb.barrier(group=group, async_op=async_op, device_ids=device_ids)
  File "/home/users/u100424/miniconda3/envs/spgen/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 160, in barrier
    return torch.distributed.barrier(group=group,
  File "/home/users/u100424/miniconda3/envs/spgen/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3147, in barrier
    work = group.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Proxy Call to rank 2 failed (Setup)

mel2045:30570:32483 [2] proxy.cc:1119 NCCL WARN [Proxy Service 2] Failed to execute operation Setup from rank 2, retcode 1

mel2045:30570:32483 [2] include/alloc.h:99 NCCL WARN Cuda failure 'out of memory'

mel2045:30570:32483 [2] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 6291456 bytes
mel2045:30570:32483 [2] NCCL INFO transport/p2p.cc:430 -> 1
mel2045:30570:32483 [2] NCCL INFO proxy.cc:989 -> 1

Expected behavior I did not expect saving to need more memory than training.

ds_report output

DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [92m[OKAY][0m
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io: please install the libaio-devel package with yum
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [93m[NO][0m ....... [93m[NO][0m
cpu_adagrad ............ [93m[NO][0m ....... [92m[OKAY][0m
cpu_adam ............... [93m[NO][0m ....... [92m[OKAY][0m
fused_adam ............. [93m[NO][0m ....... [92m[OKAY][0m
fused_lamb ............. [93m[NO][0m ....... [92m[OKAY][0m
quantizer .............. [93m[NO][0m ....... [92m[OKAY][0m
random_ltd ............. [93m[NO][0m ....... [92m[OKAY][0m
[93m [WARNING] [0m please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [93m[NO][0m ....... [93m[NO][0m
spatial_inference ...... [93m[NO][0m ....... [92m[OKAY][0m
transformer ............ [93m[NO][0m ....... [92m[OKAY][0m
stochastic_transformer . [93m[NO][0m ....... [92m[OKAY][0m
transformer_inference .. [93m[NO][0m ....... [92m[OKAY][0m
utils .................. [93m[NO][0m ....... [92m[OKAY][0m
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/users/u100424/miniconda3/envs/spgen/lib/python3.9/site-packages/torch']
torch version .................... 1.13.0+cu117
deepspeed install path ........... ['/home/users/u100424/miniconda3/envs/spgen/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.8.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

System info (please complete the following information):

Red Hat Enterprise Linux 8.6 (Ootpa)
4x A100
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python version 3.9.15

Launcher context deepspeed launcher

Apr 13 '23 07:04 fteufel