[QUESTION] Why enable `non_blocking=True` when doing synchronous D2H?

Open raywan-110 opened this issue 1 year ago • 2 comments

The comment on line 76 of filesystem_async.py indicates that Megatron performs synchronous Device-to-Host (D2H) transfers for checkpointing. However, on line 94, the code enables non_blocking=True during these transfers (code link). Unfortunately, I did not find any explicit CUDA Stream or Event synchronization primitives in the subsequent steps of the checkpointing process. Could this omission potentially introduce security risks, such as saving incomplete CPU tensors to the disk?

May 22 '24 03:05 raywan-110

FYI: screenshot

May 22 '24 03:05 raywan-110

Marking as stale. No activity in 60 days.

Jul 21 '24 18:07 github-actions[bot]