GradCache
GradCache copied to clipboard
functional approach with distributed training
Thank you for the great work!
Could you please provide some examples about functional approach with distributed multi-gpu training?
Hi @kevinlin311tw , sure, I can add an example in a day or two.
As a side note, the functional approach itself is actually agnostic to parallelism: you need only to wrap your encoder model and do cross process communication in the loss function. Maybe this comment will be helpful if you want to give it a try yourself.
I've added an example in the readme, along with a new all-gather decorator that may be helpful.
Feel free to ping me if you have any questions or find any problems with the code.