Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[QUESTION]How to debug on async_save mode?

Open stormchasingg opened this issue 5 months ago • 0 comments

Your question In my case, megatron checkpoint blocked on async_save mode. The file existed on right place but no data:

-rw-r--r-- 1 root root  6305 Nov 28 13:47 __0_0.distcp
-rw-r--r-- 1 root root  6305 Nov 28 13:47 __0_1.distcp

The py-spy dump stack is below:

Thread 36380 (idle): "MainThread"
    poll (multiprocessing/popen_fork.py:27)
    wait (multiprocessing/popen_fork.py:43)
    join (multiprocessing/process.py:149)
    close (core/dist_checkpointing/strategies/async_utils.py:248)
    is_current_async_call_done (core/dist_checkpointing/strategies/async_utils.py:236)
    maybe_finalize_async_calls (core/dist_checkpointing/strategies/async_utils.py:537)
    save (core/dist_checkpointing/strategies/base.py:228)
    save (core/dist_checkpointing/strategies/fully_parallel.py:95)
    save (core/dist_checkpointing/serialization.py:396)
Thread 38606 (idle): "Thread-2 (_pin_memory_loop)"
    select (selectors.py:415)
    wait (multiprocessing/connection.py:947)
    _poll (multiprocessing/connection.py:440)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:113)
    do_one_step (torch/utils/data/_utils/pin_memory.py:37)
    _pin_memory_loop (torch/utils/data/_utils/pin_memory.py:61)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 38607 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)

stormchasingg avatar Nov 28 '25 05:11 stormchasingg