[BUG]save model NCCL timeout

Open enbacoo opened this issue 4 months ago • 2 comments

Describe the bug When the model is large, the watchdog times out when saving the model， model is large, gpt-oss-120B, It timeout about 10 min, so the init setting seems did not work, so how can I set nccl_timeout more longer?

To Reproduce Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior when init： deepspeed.init_distributed(timeout=timedelta(minutes=60)) when save model： model_to_save.save_pretrained(output_dir, state_dict=output_state_dict, **kwargs)

ds_report output Please run ds_report to give us details about your setup.

Screenshots [rank5]:[E901 05:20:03.110823792 ProcessGroupNCCL.cpp:1685] [PG ID 0 PG GUID 0(default_pg) Rank 5] Observed flight recorder dump signal from another rank via TCPStore. [rank1]:[E901 05:20:03.110837597 ProcessGroupNCCL.cpp:1685] [PG ID 0 PG GUID 0(default_pg) Rank 1] Observed flight recorder dump signal from another rank via TCPStore. [rank3]:[E901 05:20:03.110849736 ProcessGroupNCCL.cpp:1685] [PG ID 0 PG GUID 0(default_pg) Rank 3] Observed flight recorder dump signal from another rank via TCPStore. [rank7]:[E901 05:20:03.110858837 ProcessGroupNCCL.cpp:1685] [PG ID 0 PG GUID 0(default_pg) Rank 7] Observed flight recorder dump signal from another rank via TCPStore. [rank0]:[E901 05:20:03.110881429 ProcessGroupNCCL.cpp:1685] [PG ID 0 PG GUID 0(default_pg) Rank 0] Observed flight recorder dump signal from another rank via TCPStore. [rank5]:[E901 05:20:03.110956967 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 5] Received a dump signal due to a collective timeout from rank 34 and we will try our best to dump the debug info. Last enqueued NCCL work: 460631, last completed NCCL work: 460630.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank1]:[E901 05:20:03.110974473 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 1] Received a dump signal due to a collective timeout from rank 34 and we will try our best to dump the debug info. Last enqueued NCCL work: 460631, last completed NCCL work: 460630.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank3]:[E901 05:20:03.110995590 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 3] Received a dump signal due to a collective timeout from rank 34 and we will try our best to dump the debug info. Last enqueued NCCL work: 460631, last completed NCCL work: 460630.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank7]:[E901 05:20:03.111006917 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 7] Received a dump signal due to a collective timeout from rank 34 and we will try our best to dump the debug info. Last enqueued NCCL work: 460631, last completed NCCL work: 460630.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank0]:[E901 05:20:03.111037666 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 0] Received a dump signal due to a collective timeout from rank 34 and we will try our best to dump the debug info. Last enqueued NCCL work: 460630, last completed NCCL work: 460630.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank1]:[E901 05:20:03.111301806 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 1] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank5]:[E901 05:20:03.111307784 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 5] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank7]:[E901 05:20:03.111311189 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 7] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank3]:[E901 05:20:03.111314316 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 3] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank0]:[E901 05:20:03.111318397 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank6]:[E901 05:20:03.117746264 ProcessGroupNCCL.cpp:632] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600029 milliseconds before timing out. [rank6]:[E901 05:20:03.118604598 ProcessGroupNCCL.cpp:2271] [PG ID 0 PG GUID 0(default_pg) Rank 6] failure detected by watchdog at work sequence id: 460631 PG status: last enqueued work: 460631, last completed work: 460630 [rank6]:[E901 05:20:03.118615569 ProcessGroupNCCL.cpp:670] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank6]:[E901 05:20:03.118671497 ProcessGroupNCCL.cpp:2106] [PG ID 0 PG GUID 0(default_pg) Rank 6] First PG on this rank to signal dumping. [rank2]:[E901 05:20:03.118748741 ProcessGroupNCCL.cpp:1685] [PG ID 0 PG GUID 0(default_pg) Rank 2] Observed flight recorder dump signal from another rank via TCPStore. [rank2]:[E901 05:20:03.118859045 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 2] Received a dump signal due to a collective timeout from rank 6 and we will try our best to dump the debug info. Last enqueued NCCL work: 460631, last completed NCCL work: 460630.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank2]:[E901 05:20:03.118963441 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 2] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank7]:[E901 05:20:03.123992144 ProcessGroupNCCL.cpp:632] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600020 milliseconds before timing out. [rank7]:[E901 05:20:03.124073319 ProcessGroupNCCL.cpp:2271] [PG ID 0 PG GUID 0(default_pg) Rank 7] failure detected by watchdog at work sequence id: 460631 PG status: last enqueued work: 460631, last completed work: 460630 [rank7]:[E901 05:20:03.124080696 ProcessGroupNCCL.cpp:670] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank3]:[E901 05:20:03.143902622 ProcessGroupNCCL.cpp:632] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600004 milliseconds before timing out. [rank3]:[E901 05:20:03.144014715 ProcessGroupNCCL.cpp:2271] [PG ID 0 PG GUID 0(default_pg) Rank 3] failure detected by watchdog at work sequence id: 460631 PG status: last enqueued work: 460631, last completed work: 460630 [rank3]:[E901 05:20:03.144037616 ProcessGroupNCCL.cpp:670] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank5]:[E901 05:20:03.152847707 ProcessGroupNCCL.cpp:632] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. [rank5]:[E901 05:20:03.152955167 ProcessGroupNCCL.cpp:2271] [PG ID 0 PG GUID 0(default_pg) Rank 5] failure detected by watchdog at work sequence id: 460631 PG status: last enqueued work: 460631, last completed work: 460630 [rank5]:[E901 05:20:03.152962363 ProcessGroupNCCL.cpp:670] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank4]:[E901 05:20:03.155303200 ProcessGroupNCCL.cpp:632] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. [rank4]:[E901 05:20:03.155396703 ProcessGroupNCCL.cpp:2271] [PG ID 0 PG GUID 0(default_pg) Rank 4] failure detected by watchdog at work sequence id: 460631 PG status: last enqueued work: 460631, last completed work: 460630 [rank4]:[E901 05:20:03.155402938 ProcessGroupNCCL.cpp:670] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank4]:[E901 05:20:03.155429356 ProcessGroupNCCL.cpp:2106] [PG ID 0 PG GUID 0(default_pg) Rank 4] First PG on this rank to signal dumping. [rank2]:[E901 05:20:03.155809160 ProcessGroupNCCL.cpp:632] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. [rank2]:[E901 05:20:03.155895222 ProcessGroupNCCL.cpp:2271] [PG ID 0 PG GUID 0(default_pg) Rank 2] failure detected by watchdog at work sequence id: 460631 PG status: last enqueued work: 460631, last completed work: 460630 [rank2]:[E901 05:20:03.155901519 ProcessGroupNCCL.cpp:670] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank1]:[E901 05:20:03.172428065 ProcessGroupNCCL.cpp:632] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600031 milliseconds before timing out. [rank1]:[E901 05:20:03.172521066 ProcessGroupNCCL.cpp:2271] [PG ID 0 PG GUID 0(default_pg) Rank 1] failure detected by watchdog at work sequence id: 460631 PG status: last enqueued work: 460631, last completed work: 460630 [rank1]:[E901 05:20:03.172526932 ProcessGroupNCCL.cpp:670] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank6]:[E901 05:20:03.225069971 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 6] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 460631, last completed NCCL work: 460630.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank6]:[E901 05:20:03.309266837 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 6] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank2]:[E901 05:20:03.411236007 ProcessGroupNCCL.cpp:684] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank2]:[E901 05:20:03.411277265 ProcessGroupNCCL.cpp:698] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [rank0]:[E901 05:20:03.411819715 ProcessGroupNCCL.cpp:1809] [PG ID 0 PG GUID 0(default_pg) Rank 0] Could not acquire GIL within 300 ms on exit, possible GIL induced hang [rank2]:[E901 05:20:03.413067656 ProcessGroupNCCL.cpp:1899] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7fdb575785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7fdb023cba6d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7fdb023cd7f0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fdb023ceefd in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7fdaf21e24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x891f5 (0x7fdb9bade1f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x10989c (0x7fdb9bb5e89c in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7fdb575785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7fdb023cba6d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7fdb023cd7f0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fdb023ceefd in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7fdaf21e24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x891f5 (0x7fdb9bade1f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x10989c (0x7fdb9bb5e89c in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1905 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7fdb575785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: + 0x11b4abe (0x7fdb0239dabe in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xe07bed (0x7fdb01ff0bed in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: + 0xd44a3 (0x7fdaf21e24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #4: + 0x891f5 (0x7fdb9bade1f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #5: + 0x10989c (0x7fdb9bb5e89c in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank5]:[E901 05:20:03.445217639 ProcessGroupNCCL.cpp:684] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank5]:[E901 05:20:03.445252214 ProcessGroupNCCL.cpp:698] [Rank 5] To avoid data inconsistency, we are taking the entire process down. [rank5]:[E901 05:20:03.446595912 ProcessGroupNCCL.cpp:1899] [PG ID 0 PG GUID 0(default_pg) Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7fec075785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7febb1fcba6d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7febb1fcd7f0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7febb1fceefd in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7feba1de24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x891f5 (0x7fec4b8101f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x10989c (0x7fec4b89089c in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 0 PG GUID 0(default_pg) Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7fec075785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7febb1fcba6d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7febb1fcd7f0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7febb1fceefd in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7feba1de24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x891f5 (0x7fec4b8101f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x10989c (0x7fec4b89089c in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1905 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7fec075785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: + 0x11b4abe (0x7febb1f9dabe in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xe07bed (0x7febb1bf0bed in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: + 0xd44a3 (0x7feba1de24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #4: + 0x891f5 (0x7fec4b8101f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #5: + 0x10989c (0x7fec4b89089c in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank7]:[E901 05:20:03.462210114 ProcessGroupNCCL.cpp:684] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank7]:[E901 05:20:03.462234781 ProcessGroupNCCL.cpp:698] [Rank 7] To avoid data inconsistency, we are taking the entire process down. [rank7]:[E901 05:20:03.463564641 ProcessGroupNCCL.cpp:1899] [PG ID 0 PG GUID 0(default_pg) Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600020 milliseconds before timing out. Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7fb5fa7785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7fb5a55cba6d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7fb5a55cd7f0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fb5a55ceefd in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7fb5953e24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x891f5 (0x7fb63ed701f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x10989c (0x7fb63edf089c in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 0 PG GUID 0(default_pg) Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600020 milliseconds before timing out. Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7fb5fa7785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7fb5a55cba6d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7fb5a55cd7f0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fb5a55ceefd in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7fb5953e24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x891f5 (0x7fb63ed701f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x10989c (0x7fb63edf089c in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1905 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7fb5fa7785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: + 0x11b4abe (0x7fb5a559dabe in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xe07bed (0x7fb5a51f0bed in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: + 0xd44a3 (0x7fb5953e24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #4: + 0x891f5 (0x7fb63ed701f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #5: + 0x10989c (0x7fb63edf089c in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E901 05:20:03.614679108 ProcessGroupNCCL.cpp:684] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank3]:[E901 05:20:03.614723891 ProcessGroupNCCL.cpp:698] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [rank3]:[E901 05:20:03.616086576 ProcessGroupNCCL.cpp:1899] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600004 milliseconds before timing out. Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7fbd9cf785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7fbd47dcba6d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7fbd47dcd7f0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fbd47dceefd in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7fbd37be24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x891f5 (0x7fbde15931f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x10989c (0x7fbde161389c in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600004 milliseconds before timing out. Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7fbd9cf785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7fbd47dcba6d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7fbd47dcd7f0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fbd47dceefd in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7fbd37be24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x891f5 (0x7fbde15931f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x10989c (0x7fbde161389c in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1905 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7fbd9cf785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: + 0x11b4abe (0x7fbd47d9dabe in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xe07bed (0x7fbd479f0bed in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: + 0xd44a3 (0x7fbd37be24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #4: + 0x891f5 (0x7fbde15931f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #5: + 0x10989c (0x7fbde161389c in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E901 05:20:04.798433587 ProcessGroupNCCL.cpp:684] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E901 05:20:04.798478934 ProcessGroupNCCL.cpp:698] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E901 05:20:04.799847760 ProcessGroupNCCL.cpp:1899] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600031 milliseconds before timing out. Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7f66ce1785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7f6678bcba6d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7f6678bcd7f0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6678bceefd in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7f66689e24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x891f5 (0x7f67123ed1f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x10989c (0x7f671246d89c in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600031 milliseconds before timing out. Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7f66ce1785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7f6678bcba6d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7f6678bcd7f0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6678bceefd in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7f66689e24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x891f5 (0x7f67123ed1f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x10989c (0x7f671246d89c in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1905 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7f66ce1785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: + 0x11b4abe (0x7f6678b9dabe in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xe07bed (0x7f66787f0bed in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: + 0xd44a3 (0x7f66689e24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #4: + 0x891f5 (0x7f67123ed1f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #5: + 0x10989c (0x7f671246d89c in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank6]:[E901 05:20:04.905604457 ProcessGroupNCCL.cpp:684] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank6]:[E901 05:20:04.905652844 ProcessGroupNCCL.cpp:698] [Rank 6] To avoid data inconsistency, we are taking the entire process down. [rank6]:[E901 05:20:04.907052983 ProcessGroupNCCL.cpp:1899] [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600029 milliseconds before timing out. Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7f87a2d785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7f874dbcba6d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7f874dbcd7f0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f874dbceefd in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7f873d9e24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x891f5 (0x7f87e72c51f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x10989c (0x7f87e734589c in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600029 milliseconds before timing out. Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7f87a2d785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7f874dbcba6d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7f874dbcd7f0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f874dbceefd in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7f873d9e24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x891f5 (0x7f87e72c51f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x10989c (0x7f87e734589c in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1905 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7f87a2d785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: + 0x11b4abe (0x7f874db9dabe in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xe07bed (0x7f874d7f0bed in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: + 0xd44a3 (0x7f873d9e24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #4: + 0x891f5 (0x7f87e72c51f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #5: + 0x10989c (0x7f87e734589c in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank4]:[E901 05:20:04.013497094 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 4] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 460631, last completed NCCL work: 460630.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank4]:[E901 05:20:04.013667500 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 4] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank4]:[E901 05:20:04.607687070 ProcessGroupNCCL.cpp:684] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank4]:[E901 05:20:04.607732735 ProcessGroupNCCL.cpp:698] [Rank 4] To avoid data inconsistency, we are taking the entire process down. [rank4]:[E901 05:20:04.609440126 ProcessGroupNCCL.cpp:1899] [PG ID 0 PG GUID 0(default_pg) Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7f618a1785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7f6134bcba6d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7f6134bcd7f0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6134bceefd in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7f61249e24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x891f5 (0x7f61ce4621f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x10989c (0x7f61ce4e289c in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 0 PG GUID 0(default_pg) Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=460631, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7f618a1785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7f6134bcba6d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7f6134bcd7f0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6134bceefd in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7f61249e24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x891f5 (0x7f61ce4621f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x10989c (0x7f61ce4e289c in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1905 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7f618a1785e8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: + 0x11b4abe (0x7f6134b9dabe in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xe07bed (0x7f61347f0bed in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: + 0xd44a3 (0x7f61249e24a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #4: + 0x891f5 (0x7f61ce4621f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #5: + 0x10989c (0x7f61ce4e289c in /usr/lib/x86_64-linux-gnu/libc.so.6)

W0901 05:20:14.015000 5475 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 5721 closing signal SIGTERM W0901 05:20:14.018000 5475 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 5723 closing signal SIGTERM W0901 05:20:14.018000 5475 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 5725 closing signal SIGTERM W0901 05:20:14.018000 5475 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 5726 closing signal SIGTERM W0901 05:20:44.019000 5475 torch/distributed/elastic/multiprocessing/api.py:919] Unable to shutdown process 5723 via 15, forcefully exiting via 9 E0901 05:20:48.238000 5475 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -6) local_rank: 1 (pid: 5722) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper return f(*args, kwargs) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 892, in main run(args) File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 883, in run elastic_launch( File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 139, in call** return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 270, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

openrlhf/cli/train_sft.py FAILED

Failures: [1]: time : 2025-09-01_05:20:14 host : dccd-pcde6-1004-c90a-af2-76f-d3ec.byted.org rank : 3 (local_rank: 3) exitcode : -6 (pid: 5724) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 5724 [2]: time : 2025-09-01_05:20:14 host : dccd-pcde6-1004-c90a-af2-76f-d3ec.byted.org rank : 6 (local_rank: 6) exitcode : -6 (pid: 5727) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 5727 [3]: time : 2025-09-01_05:20:14 host : dccd-pcde6-1004-c90a-af2-76f-d3ec.byted.org rank : 7 (local_rank: 7) exitcode : -6 (pid: 5728) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 5728

Root Cause (first observed failure): [0]: time : 2025-09-01_05:20:14 host : dccd-pcde6-1004-c90a-af2-76f-d3ec.byted.org rank : 1 (local_rank: 1) exitcode : -6 (pid: 5722) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 5722

System info (please complete the following information):

OS: [e.g. Ubuntu 18.04]
GPU count and types [e.g. two machines with x8 A100s each]
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python version
Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

Sep 01 '25 14:09 enbacoo

[I902 06:39:03.558436021 ProcessGroupNCCL.cpp:1078] [PG ID 0 PG GUID 0(default_pg) Rank 7] Using non-blocking mode: 0 [I902 06:39:03.558502137 ProcessGroupNCCL.cpp:2828] [PG ID 0 PG GUID 0(default_pg) Rank 7] ProcessGroupNCCL broadcast unique ID through store took 0.061753 ms [I902 06:39:03.558516015 NCCLUtils.cpp:75] Rank 7: creating NCCL communicator with mode: blocking [W902 06:39:03.559670161 Utils.hpp:166] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator()) [I902 06:39:03.559707495 ProcessGroupNCCL.cpp:950] [PG ID 0 PG GUID 0 Rank 1] TORCH_NCCL_BLOCKING_WAIT is enabled, NO watchdog thread is created. [I902 06:39:03.559717251 ProcessGroupNCCL.cpp:978] [PG ID 0 PG GUID 0 Rank 1] ProcessGroupNCCL initialization options: size: 32, global rank: 1, TIMEOUT(ms): 10800000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: -2, PG Name: 0 [I902 06:39:03.559721862 ProcessGroupNCCL.cpp:987] [PG ID 0 PG GUID 0 Rank 1] ProcessGroupNCCL environments: NCCL version: 2.26.2, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_PROPAGATE_ERROR: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 1, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 480, TORCH_NCCL_TRACE_BUFFER_SIZE: 2000, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0, TORCH_NCCL_CUDA_EVENT_CACHE: 1, TORCH_NCCL_LOG_CPP_STACK_ON_UNCLEAN_SHUTDOWN: 1 [I902 06:39:03.559800985 ProcessGroupNCCL.cpp:1042] [PG ID 0 PG GUID 0 Rank 1] Eagerly connecting nccl backend with device cuda:1 [I902 06:39:03.559829071 ProcessGroupNCCL.cpp:1078] [PG ID 0 PG GUID 0(default_pg) Rank 1] Using non-blocking mode: 0 [I902 06:39:03.559894574 ProcessGroupNCCL.cpp:2828] [PG ID 0 PG GUID 0(default_pg) Rank 1] ProcessGroupNCCL broadcast unique ID through store took 0.061443 ms [I902 06:39:03.559906811 NCCLUtils.cpp:75] Rank 1: creating NCCL communicator with mode: blocking [06:39:04] Installing extension: ms-toolsai.jupyter-keymap ms-toolsai.jupyter [06:39:04] Installing extension: ms-toolsai.jupyter-renderers ms-toolsai.jupyter [06:39:04] Installing extension: ms-toolsai.vscode-jupyter-slideshow ms-toolsai.jupyter [06:39:04] Installing extension: ms-toolsai.vscode-jupyter-cell-tags ms-toolsai.jupyter [I902 06:39:07.607101798 ProcessGroupNCCL.cpp:2861] [PG ID 0 PG GUID 0(default_pg) Rank 4] NCCL_DEBUG: INFO [I902 06:39:07.607242969 ProcessGroupNCCL.cpp:2861] [PG ID 0 PG GUID 0(default_pg) Rank 5] NCCL_DEBUG: INFO [I902 06:39:07.607457816 ProcessGroupNCCL.cpp:2861] [PG ID 0 PG GUID 0(default_pg) Rank 1] NCCL_DEBUG: INFO [rank1]:[I902 06:39:07.608540006 ProcessGroupNCCL.cpp:950] [PG ID 1 PG GUID 1 Rank 1] TORCH_NCCL_BLOCKING_WAIT is enabled, NO watchdog thread is created. [rank4]:[I902 06:39:07.608539784 ProcessGroupNCCL.cpp:950] [PG ID 1 PG GUID 1 Rank 4] TORCH_NCCL_BLOCKING_WAIT is enabled, NO watchdog thread is created. [rank5]:[I902 06:39:07.608539766 ProcessGroupNCCL.cpp:950] [PG ID 1 PG GUID 1 Rank 5] TORCH_NCCL_BLOCKING_WAIT is enabled, NO watchdog thread is created. [rank1]:[I902 06:39:07.608558008 ProcessGroupNCCL.cpp:978] [PG ID 1 PG GUID 1 Rank 1] ProcessGroupNCCL initialization options: size: 32, global rank: 1, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0xca66d30, SPLIT_COLOR: 2008404567, PG Name: 1 [rank4]:[I902 06:39:07.608567619 ProcessGroupNCCL.cpp:978] [PG ID 1 PG GUID 1 Rank 4] ProcessGroupNCCL initialization options: size: 32, global rank: 4, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0xcf35130, SPLIT_COLOR: 2008404567, PG Name: 1 [rank5]:[I902 06:39:07.608578729 ProcessGroupNCCL.cpp:978] [PG ID 1 PG GUID 1 Rank 5] ProcessGroupNCCL initialization options: size: 32, global rank: 5, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0xcb300b0, SPLIT_COLOR: 2008404567, PG Name: 1

different pg id have different timeout setting, it is strange

Sep 02 '25 07:09 enbacoo

类似的问题，同问怎么自定义设置超时时间

Sep 24 '25 02:09 looang