Megatron-LM
Megatron-LM copied to clipboard
[QUESTION]How to debug on async_save mode?
Your question In my case, megatron checkpoint blocked on async_save mode. The file existed on right place but no data:
-rw-r--r-- 1 root root 6305 Nov 28 13:47 __0_0.distcp
-rw-r--r-- 1 root root 6305 Nov 28 13:47 __0_1.distcp
The py-spy dump stack is below:
Thread 36380 (idle): "MainThread"
poll (multiprocessing/popen_fork.py:27)
wait (multiprocessing/popen_fork.py:43)
join (multiprocessing/process.py:149)
close (core/dist_checkpointing/strategies/async_utils.py:248)
is_current_async_call_done (core/dist_checkpointing/strategies/async_utils.py:236)
maybe_finalize_async_calls (core/dist_checkpointing/strategies/async_utils.py:537)
save (core/dist_checkpointing/strategies/base.py:228)
save (core/dist_checkpointing/strategies/fully_parallel.py:95)
save (core/dist_checkpointing/serialization.py:396)
Thread 38606 (idle): "Thread-2 (_pin_memory_loop)"
select (selectors.py:415)
wait (multiprocessing/connection.py:947)
_poll (multiprocessing/connection.py:440)
poll (multiprocessing/connection.py:257)
get (multiprocessing/queues.py:113)
do_one_step (torch/utils/data/_utils/pin_memory.py:37)
_pin_memory_loop (torch/utils/data/_utils/pin_memory.py:61)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 38607 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)