Matej Sirovatka
Matej Sirovatka
Managed to make a minimal repro of composable tp+fsdp2 working, though it requires nightly torch 🚀
@kmehant we have internally decided to have as much logic as possible in transformers, so this is postponed until that is ~resolved. You can watch that [here](https://github.com/huggingface/transformers/pull/37877)
This is maybe gonna be a mess, I started to play and had success. At [2a13375](https://github.com/huggingface/accelerate/pull/3498/commits/2a13375c577c309fa1ca0f4f37bc2e76033e5261) we have a working fsdp2+tp example, gonna try to clean this up a bit...
Also superseeded by #3682
In [accelerate](https://github.com/huggingface/accelerate) we have integration with both AO and TE, where AO should soon work with FSDP2. Is there anyone tackling the integration of TE? I would be limited to...
@stas00 I think given my limited availability recently and the time I'll be able to get to doing it in DeepSpeed, you can probably just integrate it with DeepSpeed as...
Will do a research on this more, if anyone has any insights on what could/should be implemented, resp. details on to how, cc me.
Maybe a preliminary would be to support for example mixtral/nllb_moe from huggingface, to have the integration ready when the layers are done?
@yundai424 Haven't seen one either, gonna try patching either Mixtral or Nllb with our kernels and profile it, will decide on what to do after that I guess. Implementing dMoE...
@pramodith I totally agree with starting with the MLP, however i'm currently surprisingly swamped with school so I won't have time to collaborate on this. So feel free to take...