Future plans for MXFP8 development
I tried searching for the list of planned features as well as the timelines for the developments planned for MXFP8 training, I was unable to find it. It would be helpful if we can know about the planned features and the timelines, for e.g. I tried out the MXFP8 all gather, but it was not implemented for MXTensors at present, would be good to about when to expect it to be enabled.
I believe https://github.com/pytorch/ao/issues/2147 has some details, but is likely not the full list
cc @danielvegamyhre @vkuzo
For mxfp8 - at a high level, we are planning to:
- continue improving trainig performance
- merge the moe and dense codebases and bring them out of prototype, continuing with torchtitan as the showcase
If you are looking for something specific, let us know! Would be interested to hear about your use case and how we can help.
When you say "merge the moe and dense codebases and bring them out of prototype", is that going to be a significant refactoring change? If there are changes will backward compatibility be ensured? Wondering if there is a timeline for this?
I was trying out the MXFP8 all gather and it seems like that is not currently supported, that would be one of features I would like to use. Is there a timeline for this?
When you say "merge the moe and dense codebases and bring them out of prototype", is that going to be a significant refactoring change? If there are changes will backward compatibility be ensured? Wondering if there is a timeline for this?
We developed the dense APIs (prototype/mx_formats) separately from the MoE APIs (prototype/moe_training). Before we add BC guarantees, we want to unify them in a single place to match the rest of torchao. This will be a BC-breaking change, and the PR which makes the change this will clearly spell out the "before" vs "after" APIs so it will be easy for callsites to migrate. I'd estimate timing to be 2025Q4 to 2026Q1.
Thanks for your response, wondering if there is a plan for enabling MXFP8 all gather?
cc @danielvegamyhre
@avizon-aws I created https://github.com/pytorch/ao/issues/3379 with the next steps I have in mind for mxfp8 MoE training. Please feel free to comment on the issue with any questions or suggestions!