Does torchrun + FSDP create multiple copies of the same dataset and model?

Open tsengalb99 opened this issue 1 year ago • 1 comments

In the example T5 training code, the main function creates a copy of the model and dataset regardless of the worker rank before passing it to FSDP. Does this mean that there are n copies of the model and dataset when running the script with torchrun and n processes?

Sep 25 '24 03:09 tsengalb99

My code is set up in a similar way as the T5 example code and the memory consumption per gpu is the same regardless of the number of torchrun processes I use, so it does seem like I am creating n copies of the model. How can I avoid this?

Sep 25 '24 04:09 tsengalb99