ms-swift qwen2.5 3b sft 多机训练运行到固定的某个step会有一个节点显存 OOM

我们运行了一个job 8台8卡A100 80g 的任务会在330步左右有一张卡OOM导致整个任务失败。启动参数如下 torchrun --nnodes=8 --node_rank=0 --master_addr=qwen2-5vl-3b-pretrain-transcribe-20m-bs-zero3-52zfi-master-0.qwen2-5vl-3b-pretrain-transcribe-20m-bs-zero3-52zfi --nproc_per_node=8 --master_port=34229 swift/cli/sft.py --model /data/CV/Qwen2.5-VL-3B-Instruct --model_type qwen2_5_vl --train_type full --freeze_llm False --freeze_vit False --freeze_aligner False --output_dir /data/CV/lmm_model_output --dataset /data/CV/datasets/mllm-deploy --per_device_train_batch_size 1 --gradient_accumulation_steps 4 --learning_rate 1e-5 --max_length 4096 --num_train_epochs 1 --save_steps 5000 --eval_steps 2000000 --save_total_limit 1 --logging_steps 5 --warmup_ratio 0.05 --torch_dtype bfloat16 --dataloader_num_workers 8 --dataset_num_proc 8 --max_pixels 351232 --deepspeed zero2

[rank62]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 19.14 GiB. GPU 6 has a total capacity of 79.33 GiB of which 8.35 GiB is free. Including non-PyTorch memory, this process has 70.97 GiB memory in use. Of the allocated memory 68.65 GiB is allocated by PyTorch, and 853.04 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

version 3.1.1 & 3.1.0

一台节点OOM之后，过10min之后会报NCCL超时每次都是固定在334 这步oom

Train:   0%|          | 334/77282 [47:04<111:20:48,  5.21s/it][rank2]:[E228 10:42:56.780357497 ProcessGroupNCCL.cpp:629] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28895, OpType=ALLREDUCE, NumelIn=196976700, NumelOut=196976700, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
[rank2]:[E228 10:42:56.788738267 ProcessGroupNCCL.cpp:2168] [PG ID 1 PG GUID 1 Rank 2]  failure detected by watchdog at work sequence id: 28895 PG status: last enqueued work: 28896, last completed work: 28894
[rank2]:[E228 10:42:56.791395244 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank6]:[E228 10:42:56.793383362 ProcessGroupNCCL.cpp:629] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28895, OpType=ALLREDUCE, NumelIn=196976700, NumelOut=196976700, Timeout(ms)=600000) ran for 600015 milliseconds before timing out.
[rank6]:[E228 10:42:56.793466151 ProcessGroupNCCL.cpp:2168] [PG ID 1 PG GUID 1 Rank 6]  failure detected by watchdog at work sequence id: 28895 PG status: last enqueued work: 28896, last completed work: 28894
[rank6]:[E228 10:42:56.793476083 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank4]:[E228 10:42:56.797401918 ProcessGroupNCCL.cpp:629] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28895, OpType=ALLREDUCE, NumelIn=196976700, NumelOut=196976700, Timeout(ms)=600000) ran for 600006 milliseconds before timing out.
[rank4]:[E228 10:42:56.797502233 ProcessGroupNCCL.cpp:2168] [PG ID 1 PG GUID 1 Rank 4]  failure detected by watchdog at work sequence id: 28895 PG status: last enqueued work: 28896, last completed work: 28894
[rank4]:[E228 10:42:56.797514956 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank5]:[E228 10:42:56.810307077 ProcessGroupNCCL.cpp:629] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28895, OpType=ALLREDUCE, NumelIn=196976700, NumelOut=196976700, Timeout(ms)=600000) ran for 600053 milliseconds before timing out.
[rank5]:[E228 10:42:56.810387936 ProcessGroupNCCL.cpp:2168] [PG ID 1 PG GUID 1 Rank 5]  failure detected by watchdog at work sequence id: 28895 PG status: last enqueued work: 28896, last completed work: 28894
[rank5]:[E228 10:42:56.810398173 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank7]:[E228 10:42:56.838085653 ProcessGroupNCCL.cpp:629] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28895, OpType=ALLREDUCE, NumelIn=196976700, NumelOut=196976700, Timeout(ms)=600000) ran for 600080 milliseconds before timing out.
[rank7]:[E228 10:42:56.838168965 ProcessGroupNCCL.cpp:2168] [PG ID 1 PG GUID 1 Rank 7]  failure detected by watchdog at work sequence id: 28895 PG status: last enqueued work: 28896, last completed work: 28894
[rank7]:[E228 10:42:56.838181214 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank1]:[E228 10:42:56.858107403 ProcessGroupNCCL.cpp:629] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28895, OpType=ALLREDUCE, NumelIn=196976700, NumelOut=196976700, Timeout(ms)=600000) ran for 600092 milliseconds before timing out.
[rank1]:[E228 10:42:56.858186078 ProcessGroupNCCL.cpp:2168] [PG ID 1 PG GUID 1 Rank 1]  failure detected by watchdog at work sequence id: 28895 PG status: last enqueued work: 28896, last completed work: 28894
[rank1]:[E228 10:42:56.858196012 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank3]:[E228 10:42:56.861499619 ProcessGroupNCCL.cpp:629] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28895, OpType=ALLREDUCE, NumelIn=196976700, NumelOut=196976700, Timeout(ms)=600000) ran for 600069 milliseconds before timing out.
[rank3]:[E228 10:42:56.861567571 ProcessGroupNCCL.cpp:2168] [PG ID 1 PG GUID 1 Rank 3]  failure detected by watchdog at work sequence id: 28895 PG status: last enqueued work: 28896, last completed work: 28894
[rank3]:[E228 10:42:56.861578342 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank0]:[E228 10:42:56.861642472 ProcessGroupNCCL.cpp:629] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28895, OpType=ALLREDUCE, NumelIn=196976700, NumelOut=196976700, Timeout(ms)=600000) ran for 600098 milliseconds before timing out.
[rank0]:[E228 10:42:56.861715752 ProcessGroupNCCL.cpp:2168] [PG ID 1 PG GUID 1 Rank 0]  failure detected by watchdog at work sequence id: 28895 PG status: last enqueued work: 28896, last completed work: 28894
[rank0]:[E228 10:42:56.861725134 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
iv-ydin754tmoxjd1twe53l:2673725:2683527 [0] NCCL INFO [Service thread] Connection closed by localRank 1
iv-ydin754tmoxjd1twe53l:2673726:2683519 [1] NCCL INFO [Service thread] Connection closed by localRank 1
iv-ydin754tmoxjd1twe53l:2673727:2683522 [2] NCCL INFO [Service thread] Connection closed by localRank 1
iv-ydin754tmoxjd1twe53l:2673729:2683521 [4] NCCL INFO [Service thread] Connection closed by localRank 1
iv-ydin754tmoxjd1twe53l:2673731:2683517 [6] NCCL INFO [Service thread] Connection closed by localRank 1
iv-ydin754tmoxjd1twe53l:2673726:2683346 [1] NCCL INFO comm 0x557a539f2fe0 rank 1 nranks 64 cudaDev 1 busId 52000 - Abort COMPLETE
[rank1]:[E228 10:42:58.371032448 ProcessGroupNCCL.cpp:681] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E228 10:42:58.371048386 ProcessGroupNCCL.cpp:695] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E228 10:42:58.376320207 ProcessGroupNCCL.cpp:1895] [PG ID 1 PG GUID 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28895, OpType=ALLREDUCE, NumelIn=196976700, NumelOut=196976700, Timeout(ms)=600000) ran for 600092 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9634f6c1b6 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7f95e3229c74 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f95e322b7d0 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f95e322c6ed in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f96354575c0 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7f9635cc0ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126a40 (0x7f9635d52a40 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 1 PG GUID 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28895, OpType=ALLREDUCE, NumelIn=196976700, NumelOut=196976700, Timeout(ms)=600000) ran for 600092 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9634f6c1b6 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7f95e3229c74 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f95e322b7d0 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f95e322c6ed in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f96354575c0 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7f9635cc0ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126a40 (0x7f9635d52a40 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9634f6c1b6 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5c6fc (0x7f95e2e876fc in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7f96354575c0 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x7f9635cc0ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126a40 (0x7f9635d52a40 in /lib/x86_64-linux-gnu/libc.so.6)

iv-ydin754tmoxjd1twe53l:2673725:2683527 [0] NCCL INFO [Service thread] Connection closed by localRank 7
iv-ydin754tmoxjd1twe53l:2673727:2683522 [2] NCCL INFO [Service thread] Connection closed by localRank 7
iv-ydin754tmoxjd1twe53l:2673729:2683521 [4] NCCL INFO [Service thread] Connection closed by localRank 7
iv-ydin754tmoxjd1twe53l:2673731:2683517 [6] NCCL INFO [Service thread] Connection closed by localRank 7
iv-ydin754tmoxjd1twe53l:2673732:2683518 [7] NCCL INFO [Service thread] Connection closed by localRank 7
iv-ydin754tmoxjd1twe53l:2673725:2683527 [0] NCCL INFO [Service thread] Connection closed by localRank 5
iv-ydin754tmoxjd1twe53l:2673727:2683522 [2] NCCL INFO [Service thread] Connection closed by localRank 5
iv-ydin754tmoxjd1twe53l:2673729:2683521 [4] NCCL INFO [Service thread] Connection closed by localRank 5
iv-ydin754tmoxjd1twe53l:2673731:2683517 [6] NCCL INFO [Service thread] Connection closed by localRank 5
iv-ydin754tmoxjd1twe53l:2673730:2683520 [5] NCCL INFO [Service thread] Connection closed by localRank 5
iv-ydin754tmoxjd1twe53l:2673732:2683330 [7] NCCL INFO comm 0x55c36c98e230 rank 7 nranks 64 cudaDev 7 busId d8000 - Abort COMPLETE
[rank7]:[E228 10:42:58.640867461 ProcessGroupNCCL.cpp:681] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E228 10:42:58.640879264 ProcessGroupNCCL.cpp:695] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
iv-ydin754tmoxjd1twe53l:2673730:2683341 [5] NCCL INFO comm 0x55a2bd8581d0 rank 5 nranks 64 cudaDev 5 busId b7000 - Abort COMPLETE
[rank5]:[E228 10:42:58.641724879 ProcessGroupNCCL.cpp:681] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank5]:[E228 10:42:58.641735622 ProcessGroupNCCL.cpp:695] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E228 10:42:58.642349972 ProcessGroupNCCL.cpp:1895] [PG ID 1 PG GUID 1 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28895, OpType=ALLREDUCE, NumelIn=196976700, NumelOut=196976700, Timeout(ms)=600000) ran for 600080 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f63b816c1b6 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7f6366429c74 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f636642b7d0 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f636642c6ed in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f63b86495c0 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7f63b8eb2ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126a40 (0x7f63b8f44a40 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 1 PG GUID 1 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28895, OpType=ALLREDUCE, NumelIn=196976700, NumelOut=196976700, Timeout(ms)=600000) ran for 600080 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f63b816c1b6 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7f6366429c74 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f636642b7d0 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f636642c6ed in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f63b86495c0 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7f63b8eb2ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126a40 (0x7f63b8f44a40 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f63b816c1b6 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5c6fc (0x7f63660876fc in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7f63b86495c0 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x7f63b8eb2ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126a40 (0x7f63b8f44a40 in /lib/x86_64-linux-gnu/libc.so.6)

[rank5]:[E228 10:42:58.643091102 ProcessGroupNCCL.cpp:1895] [PG ID 1 PG GUID 1 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28895, OpType=ALLREDUCE, NumelIn=196976700, NumelOut=196976700, Timeout(ms)=600000) ran for 600053 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe7ffd6c1b6 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7fe7ae029c74 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7fe7ae02b7d0 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fe7ae02c6ed in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7fe8002625c0 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7fe800acbac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126a40 (0x7fe800b5da40 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 1 PG GUID 1 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28895, OpType=ALLREDUCE, NumelIn=196976700, NumelOut=196976700, Timeout(ms)=600000) ran for 600053 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe7ffd6c1b6 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7fe7ae029c74 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7fe7ae02b7d0 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fe7ae02c6ed in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7fe8002625c0 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7fe800acbac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126a40 (0x7fe800b5da40 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe7ffd6c1b6 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5c6fc (0x7fe7adc876fc in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7fe8002625c0 in /data/CV/rui.zou/conda/envs/ms-swift/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x7fe800acbac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126a40 (0x7fe800b5da40 in /lib/x86_64-linux-gnu/libc.so.6)

请问是否有什么参数能够优化一下。

尝试了zero3，但是速度有些慢。

Feb 28 '25 11:02 phantooom

解决了吗，我也是这个问题...

Apr 14 '25 06:04 lukasindeed

使用--deepspeed zero3试试

Nov 05 '25 15:11 IamMegatron2025

最后发现是有条数据有问题。删掉就好了。

Nov 13 '25 09:11 phantooom

qwen2.5 3b sft 多机训练 运行到固定的某个step会有一个节点显存 OOM

qwen2.5 3b sft 多机训练运行到固定的某个step会有一个节点显存 OOM