OneTrainer icon indicating copy to clipboard operation
OneTrainer copied to clipboard

Muon and AdaMuon: Orthogonal Optimizers for AdvOptim

Open Koratahiu opened this issue 3 months ago • 13 comments

This PR introduces Muon (added) and AdaMuon (~~to be added~~ added) as new experimental optimizers for adv-optm.

  1. Muon: An orthogonalizing optimizer that has demonstrated strong performance across diverse tasks. It is competitive with Adam while maintaining only a single momentum state.

  2. AdaMuon: Builds on Muon by incorporating an element-wise second-moment estimator and sign-stabilized orthogonal updates, enabling adaptive scaling while preserving stable update geometry.


Notes

  • ~~I implemented the reshaping from SMMF directly in Muon (controlled by 1D Vector Reshape) to make it work for all dims and shapes. It should be fine since Muon already flattens high-dim params to 2D; reshaping 1D params for the orthogonal update should result in higher quality while ensuring that Muon won't fallback to normal SGD.~~
  • Select 5×–20× higher LR than Adam for Muon.
  • For AuxAdam select your usual Adam LR.
  • Click the three dots near MuonWithAuxAdam to open AdamW_Adv settings that will be used as auxiliary optimizer with Muon.
  • Simplified_AdEMAMix method proved to work very well with Muon/AdaMuon in my tests, and it keeps the LR range of Muon (no need to decrease the LR). And it's now added as an option for both. You still need to tune beta1 depending on the training length.

Usage

git fetch origin pull/1064/head:muon
git checkout muon
venv/scripts/activate
pip install adv-optm==1.2.dev14 --upgrade

TODO

  • [x] Compare Muon against Adam
  • ~~Test Factored=True for Muon~~ Works
  • [x] Implement MuonWithAuxAdam
  • [x] Add AdaMuon
  • [x] Compare AdaMuon against Muon~~
  • ~~Test Factored=True for AdaMuon~~ Works
  • ~~Maybe add as an option to Prodigy_adv to test if it works with Prodigy~~ (not working).
  • [x] Enhance the auto preset for MuonWithAuxAdam and make it model-specific

Koratahiu avatar Oct 17 '25 16:10 Koratahiu

Muon is going to need a layer filter: https://github.com/KellerJordan/Muon?tab=readme-ov-file#usage

But might be worth it: https://github.com/Nerogar/OneTrainer/issues/868 Those tests were done with a manual layer filter if I remember correctly.

dxqb avatar Oct 17 '25 16:10 dxqb

Why layer filter, though?
If it’s because of 1D params, we already reshape them to 2D effectively by the SMMF method (when 1D Vector Reshape = True).

Koratahiu avatar Oct 17 '25 16:10 Koratahiu

Why layer filter, though? If it’s because of 1D params, we already reshape them to 2D effectively by the SMMF method (when 1D Vector Reshape = True).

I cannot argue the theory, but you'll find somewhere in the repo of the original author why they have introduced the AuxAdam variant and why they think Muon cannot work well on even some layers with 2D parameters.

dxqb avatar Oct 17 '25 16:10 dxqb

I looked at their code

# optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, betas=(0.90, 0.95), weight_decay=0.01)

# To replace the above, do the following:

from muon import MuonWithAuxAdam
hidden_weights = [p for p in model.body.parameters() if p.ndim >= 2]
hidden_gains_biases = [p for p in model.body.parameters() if p.ndim < 2]
nonhidden_params = [*model.head.parameters(), *model.embed.parameters()]
param_groups = [
    dict(params=hidden_weights, use_muon=True,
         lr=0.02, weight_decay=0.01),
    dict(params=hidden_gains_biases+nonhidden_params, use_muon=False,
         lr=3e-4, betas=(0.9, 0.95), weight_decay=0.01),
]
optimizer = MuonWithAuxAdam(param_groups)

And it seems my hypothesis is correct: they applied only ≥2D parameters to Muon, while assigning most 1D parameters to Adam.
This is better than falling back to SGD, but I see our approach of reshaping as a better option than switching between optimizers,
as it’s proven to work well with SMMF factorization (and Muon does something similar for >2D parameters by flattening them, so we’re just extending that idea).

Koratahiu avatar Oct 17 '25 17:10 Koratahiu

take a second look at the code. they assign Adam to 1D and they assign Adam to all parameters that the user must set. named nonhidden_params

dxqb avatar Oct 17 '25 17:10 dxqb

I did something very similar in the K-B PR, but it ended up changing a lot of files and was eventually reverted.
It should be easy to implement again, though, with a fallback inside the same optimizer, and we can then run comparisons to see whether this approach is worth it or not.

Koratahiu avatar Oct 17 '25 17:10 Koratahiu

Added MuonWithAuxAdam optimizer to TODO list

Koratahiu avatar Oct 17 '25 17:10 Koratahiu

I did something very similar in the K-B PR, but it ended up changing a lot of files and was eventually reverted. It should be easy to implement again, though, with a fallback inside the same optimizer, and we can then run comparisons to see whether this approach is worth it or not.

:thumbsup: would be great if you could add the original Muon as well not only Muon_Adv, so we have a comparison to the original

if you do decide that a layer filter is worth it, have a look at the layer filter for training on the Training tab. It's pretty generic and can be re-used.

edit: a layer filter that cannot be automated but the user has to choose

dxqb avatar Oct 17 '25 17:10 dxqb

would be great if you could add the original Muon as well not only Muon_Adv, so we have a comparison to the original

Yeah, I meant as an option to Muon_Adv, as it will be reduced to the original when you disable 1D Vector Reshape.

a layer filter that cannot be automated but the user has to choose

Why is that? I am thinking of adding an option to automate it, since the models we support are limited, and it should be easy to implement. Also, most users won't bother looking at the architecture of the models and extracting the names of non-hidden layers. And those who want to manually choose can disable it and provide their own layer keys. I prefer this approach

Koratahiu avatar Oct 18 '25 00:10 Koratahiu

MuonWithAuxAdam is now available as an option for Muon_adv, if anyone wants to test the proposed muon as suggested by its author.
It uses ADAMW_ADV (special UI for it inside Muon) for non-hidden layers (either auto or user-selected) and 1D vectors, and the rest of the training uses MUON_ADV.

Captasured

Koratahiu avatar Oct 18 '25 11:10 Koratahiu

  • [x] In PyTorch 2.9, Muon uses RMS scaling, which scales its updates to match Adam’s learning rate range.
    In this PR, we already have this feature, but it’s only enabled for NorMuon/AdaMuon, so change this behaviour and add it as a new boolean flag, defaulting to True.

Koratahiu avatar Nov 07 '25 05:11 Koratahiu

  • [ ] In PyTorch 2.9, Muon uses RMS scaling, which scales its updates to match Adam’s learning rate range. In this PR, we already have this feature, but it’s only enabled for NorMuon/AdaMuon, so change this behaviour and add it as a new boolean flag, defaulting to True.

Given Dxq found regressions in features that shouldnt have regress in 2.9.0 and that all the testing has been done on 2.8, there is almost no reason to upgrade. Its reasonable to assume they broken even more things in their attempt to completely remove maxwell and pascal support.

TLDR: Please dont bother with 2.9

O-J1 avatar Nov 07 '25 07:11 O-J1

Given Dxq found regressions in features that shouldnt have regress in 2.9.0 and that all the testing has been done on 2.8, there is almost no reason to upgrade. Its reasonable to assume they broken even more things in their attempt to completely remove maxwell and pascal support.

TLDR: Please dont bother with 2.9

Until they add MuonWithAuxAdam, I don't think PyTorch Muon is a good option (as a reference optimizer).

Koratahiu avatar Nov 07 '25 20:11 Koratahiu

See comments

Reverted both changes of UIState and BaseConfig, and included the fix inside Muon logic path

Koratahiu avatar Nov 22 '25 08:11 Koratahiu

I integrated the latest AdvOptm version in this PR. This version (1.2.8) includes:

  • Cautious Weight Decay for all adv optimizers (a toggle feature that I observe is getting significant adoption in ML repositories).
  • Improved parameter update and weight decay for BF16 SR. The updates are now accumulated in float32 and rounded once at the end.
  • Fixed several bugs in Cautious and Grams options. I remember @miasik was getting NaNs, this might be the cause.
  • Implemented the original Muon logic for multi-GPUs in Muon adv variants. This logic should be faster and more accurate (for multi-GPUs), but I suspect it is incompatible with fused backpass?

Koratahiu avatar Nov 29 '25 09:11 Koratahiu