Muon and AdaMuon: Orthogonal Optimizers for AdvOptim
This PR introduces Muon (added) and AdaMuon (~~to be added~~ added) as new experimental optimizers for adv-optm.
-
Muon: An orthogonalizing optimizer that has demonstrated strong performance across diverse tasks. It is competitive with Adam while maintaining only a single momentum state.
-
AdaMuon: Builds on Muon by incorporating an element-wise second-moment estimator and sign-stabilized orthogonal updates, enabling adaptive scaling while preserving stable update geometry.
Notes
- ~~I implemented the reshaping from SMMF directly in Muon (controlled by
1D Vector Reshape) to make it work for all dims and shapes. It should be fine since Muon already flattens high-dim params to 2D; reshaping 1D params for the orthogonal update should result in higher quality while ensuring that Muon won't fallback to normal SGD.~~ - Select 5×–20× higher LR than Adam for Muon.
- For AuxAdam select your usual Adam LR.
- Click the three dots near
MuonWithAuxAdamto open AdamW_Adv settings that will be used as auxiliary optimizer with Muon. -
Simplified_AdEMAMixmethod proved to work very well with Muon/AdaMuon in my tests, and it keeps the LR range of Muon (no need to decrease the LR). And it's now added as an option for both. You still need to tune beta1 depending on the training length.
Usage
git fetch origin pull/1064/head:muon
git checkout muon
venv/scripts/activate
pip install adv-optm==1.2.dev14 --upgrade
TODO
- [x] Compare Muon against Adam
- ~~Test
Factored=Truefor Muon~~ Works - [x] Implement MuonWithAuxAdam
- [x] Add AdaMuon
- [x] Compare AdaMuon against Muon~~
- ~~Test
Factored=Truefor AdaMuon~~ Works - ~~Maybe add as an option to
Prodigy_advto test if it works with Prodigy~~ (not working). - [x] Enhance the auto preset for MuonWithAuxAdam and make it model-specific
Muon is going to need a layer filter: https://github.com/KellerJordan/Muon?tab=readme-ov-file#usage
But might be worth it: https://github.com/Nerogar/OneTrainer/issues/868 Those tests were done with a manual layer filter if I remember correctly.
Why layer filter, though?
If it’s because of 1D params, we already reshape them to 2D effectively by the SMMF method (when 1D Vector Reshape = True).
Why layer filter, though? If it’s because of 1D params, we already reshape them to 2D effectively by the SMMF method (when
1D Vector Reshape= True).
I cannot argue the theory, but you'll find somewhere in the repo of the original author why they have introduced the AuxAdam variant and why they think Muon cannot work well on even some layers with 2D parameters.
I looked at their code
# optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, betas=(0.90, 0.95), weight_decay=0.01)
# To replace the above, do the following:
from muon import MuonWithAuxAdam
hidden_weights = [p for p in model.body.parameters() if p.ndim >= 2]
hidden_gains_biases = [p for p in model.body.parameters() if p.ndim < 2]
nonhidden_params = [*model.head.parameters(), *model.embed.parameters()]
param_groups = [
dict(params=hidden_weights, use_muon=True,
lr=0.02, weight_decay=0.01),
dict(params=hidden_gains_biases+nonhidden_params, use_muon=False,
lr=3e-4, betas=(0.9, 0.95), weight_decay=0.01),
]
optimizer = MuonWithAuxAdam(param_groups)
And it seems my hypothesis is correct: they applied only ≥2D parameters to Muon, while assigning most 1D parameters to Adam.
This is better than falling back to SGD, but I see our approach of reshaping as a better option than switching between optimizers,
as it’s proven to work well with SMMF factorization (and Muon does something similar for >2D parameters by flattening them, so we’re just extending that idea).
take a second look at the code. they assign Adam to 1D and they assign Adam to all parameters that the user must set. named nonhidden_params
I did something very similar in the K-B PR, but it ended up changing a lot of files and was eventually reverted.
It should be easy to implement again, though, with a fallback inside the same optimizer, and we can then run comparisons to see whether this approach is worth it or not.
Added MuonWithAuxAdam optimizer to TODO list
I did something very similar in the K-B PR, but it ended up changing a lot of files and was eventually reverted. It should be easy to implement again, though, with a fallback inside the same optimizer, and we can then run comparisons to see whether this approach is worth it or not.
:thumbsup: would be great if you could add the original Muon as well not only Muon_Adv, so we have a comparison to the original
if you do decide that a layer filter is worth it, have a look at the layer filter for training on the Training tab. It's pretty generic and can be re-used.
edit: a layer filter that cannot be automated but the user has to choose
would be great if you could add the original Muon as well not only Muon_Adv, so we have a comparison to the original
Yeah, I meant as an option to Muon_Adv, as it will be reduced to the original when you disable 1D Vector Reshape.
a layer filter that cannot be automated but the user has to choose
Why is that? I am thinking of adding an option to automate it, since the models we support are limited, and it should be easy to implement. Also, most users won't bother looking at the architecture of the models and extracting the names of non-hidden layers. And those who want to manually choose can disable it and provide their own layer keys. I prefer this approach
MuonWithAuxAdam is now available as an option for Muon_adv, if anyone wants to test the proposed muon as suggested by its author.
It uses ADAMW_ADV (special UI for it inside Muon) for non-hidden layers (either auto or user-selected) and 1D vectors, and the rest of the training uses MUON_ADV.
- [x] In PyTorch 2.9, Muon uses RMS scaling, which scales its updates to match Adam’s learning rate range.
In this PR, we already have this feature, but it’s only enabled for NorMuon/AdaMuon, so change this behaviour and add it as a new boolean flag, defaulting toTrue.
- [ ] In PyTorch 2.9, Muon uses RMS scaling, which scales its updates to match Adam’s learning rate range. In this PR, we already have this feature, but it’s only enabled for NorMuon/AdaMuon, so change this behaviour and add it as a new boolean flag, defaulting to
True.
Given Dxq found regressions in features that shouldnt have regress in 2.9.0 and that all the testing has been done on 2.8, there is almost no reason to upgrade. Its reasonable to assume they broken even more things in their attempt to completely remove maxwell and pascal support.
TLDR: Please dont bother with 2.9
Given Dxq found regressions in features that shouldnt have regress in 2.9.0 and that all the testing has been done on 2.8, there is almost no reason to upgrade. Its reasonable to assume they broken even more things in their attempt to completely remove maxwell and pascal support.
TLDR: Please dont bother with 2.9
Until they add MuonWithAuxAdam, I don't think PyTorch Muon is a good option (as a reference optimizer).
See comments
Reverted both changes of UIState and BaseConfig, and included the fix inside Muon logic path
I integrated the latest AdvOptm version in this PR. This version (1.2.8) includes:
- Cautious Weight Decay for all adv optimizers (a toggle feature that I observe is getting significant adoption in ML repositories).
- Improved parameter update and weight decay for BF16 SR. The updates are now accumulated in float32 and rounded once at the end.
- Fixed several bugs in Cautious and Grams options. I remember @miasik was getting NaNs, this might be the cause.
- Implemented the original Muon logic for multi-GPUs in Muon adv variants. This logic should be faster and more accurate (for multi-GPUs), but I suspect it is incompatible with fused backpass?