Jinjie Ni

Results 1 issues of Jinjie Ni

It seems that in current implementation the torch_dist checkpointing and loading will introduce around 2GB GPU memory overhead for rank 0 (for a 400m model), which will cause OOM if...

stale