ao icon indicating copy to clipboard operation
ao copied to clipboard

Future plans for MXFP8 development

Open avizon-aws opened this issue 3 months ago • 7 comments

I tried searching for the list of planned features as well as the timelines for the developments planned for MXFP8 training, I was unable to find it. It would be helpful if we can know about the planned features and the timelines, for e.g. I tried out the MXFP8 all gather, but it was not implemented for MXTensors at present, would be good to about when to expect it to be enabled.

avizon-aws avatar Nov 12 '25 10:11 avizon-aws

I believe https://github.com/pytorch/ao/issues/2147 has some details, but is likely not the full list

cc @danielvegamyhre @vkuzo

supriyar avatar Nov 12 '25 22:11 supriyar

For mxfp8 - at a high level, we are planning to:

  • continue improving trainig performance
  • merge the moe and dense codebases and bring them out of prototype, continuing with torchtitan as the showcase

If you are looking for something specific, let us know! Would be interested to hear about your use case and how we can help.

vkuzo avatar Nov 14 '25 11:11 vkuzo

When you say "merge the moe and dense codebases and bring them out of prototype", is that going to be a significant refactoring change? If there are changes will backward compatibility be ensured? Wondering if there is a timeline for this?

I was trying out the MXFP8 all gather and it seems like that is not currently supported, that would be one of features I would like to use. Is there a timeline for this?

avizon-aws avatar Nov 18 '25 08:11 avizon-aws

When you say "merge the moe and dense codebases and bring them out of prototype", is that going to be a significant refactoring change? If there are changes will backward compatibility be ensured? Wondering if there is a timeline for this?

We developed the dense APIs (prototype/mx_formats) separately from the MoE APIs (prototype/moe_training). Before we add BC guarantees, we want to unify them in a single place to match the rest of torchao. This will be a BC-breaking change, and the PR which makes the change this will clearly spell out the "before" vs "after" APIs so it will be easy for callsites to migrate. I'd estimate timing to be 2025Q4 to 2026Q1.

vkuzo avatar Nov 18 '25 15:11 vkuzo

Thanks for your response, wondering if there is a plan for enabling MXFP8 all gather?

avizon-aws avatar Nov 18 '25 17:11 avizon-aws

cc @danielvegamyhre

vkuzo avatar Nov 24 '25 14:11 vkuzo

@avizon-aws I created https://github.com/pytorch/ao/issues/3379 with the next steps I have in mind for mxfp8 MoE training. Please feel free to comment on the issue with any questions or suggestions!

danielvegamyhre avatar Nov 24 '25 15:11 danielvegamyhre