Feedback about Asynchronous Saving with Distributed Checkpoint (DCP)
Hey there! Little nitpick about the last block of this docs page: https://docs.pytorch.org/tutorials/recipes/distributed_async_checkpoint_recipe.html
The checkpoint_future variable is never written to in the last block.
Perhaps the intent was to have this instead?
checkpoint_future = dcp.async_save(state_dict, storage_writer=writer, checkpoint_id=f"{CHECKPOINT_DIR}_step{step}")
cc @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn @ekr0 @haochengsong @Saiteja64
Also, I thought (based on this blog https://discuss.pytorch.org/t/distributed-w-torchtitan-optimizing-checkpointing-efficiency-with-pytorch-dcp/211250) that one requirement for lowering the overhead of async DCP was that you had to make sure the checkpointing work was overlapping the forward and backward pass, and was completed (and the cuda stream synchronized) before the next optimizer.step(), no?
Here you're making sure that the checkpointing is done after the next call to dcp.async_save, but shouldn't this block come before the optimizer step modifies the weights in-place? And shouldn't there be some kind of cuda stream synchronization? Or is that hidden in the StorageWriter?
writer = StorageWriter(cache_staged_state_dict=True, path=CHECKPOINT_DIR)
checkpoint_future = None
for step in range(10):
optimizer.zero_grad()
model(torch.rand(8, 16, device="cuda")).sum().backward()
# Shouldn't this be done here, before the optimizer step?
if checkpoint_future is not None:
# waits for checkpointing to finish, avoiding queuing more then one checkpoint request at a time
checkpoint_future.result()
optimizer.step()
state_dict = { "app": AppState(model, optimizer) }
# Missing this assignment I think
checkpoint_future = dcp.async_save(state_dict, storage_writer=writer, checkpoint_id=f"{CHECKPOINT_DIR}_step{step}")
cleanup()
@lebrice looking at the file again, I can see that line 266 already has the assignment checkpoint_future = dcp.async_save(...)