Jinjie Ni
Results
1
issues of
Jinjie Ni
It seems that in current implementation the torch_dist checkpointing and loading will introduce around 2GB GPU memory overhead for rank 0 (for a 400m model), which will cause OOM if...
stale