Koratahiu
Koratahiu
> Thanks! I don't have a HV transformer file, so I'm going to merge without testing. Please double-check the PR with this in mind. I tested it on embedding training...
Why layer filter, though? If it’s because of 1D params, we already reshape them to 2D effectively by the SMMF method (when `1D Vector Reshape` = True).
I looked at their code ``` # optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, betas=(0.90, 0.95), weight_decay=0.01) # To replace the above, do the following: from muon import MuonWithAuxAdam hidden_weights = [p for...
I did something very similar in the K-B PR, but it ended up changing a lot of files and was eventually reverted. It should be easy to implement again, though,...
Added MuonWithAuxAdam optimizer to TODO list
> would be great if you could add the original Muon as well not only Muon_Adv, so we have a comparison to the original Yeah, I meant as an option...
`MuonWithAuxAdam` is now available as an option for `Muon_adv`, if anyone wants to test the proposed muon as suggested by its author. It uses `ADAMW_ADV` (special UI for it inside...
- [x] In [PyTorch 2.9, Muon](https://docs.pytorch.org/docs/stable/generated/torch.optim.Muon.html#torch.optim.Muon) uses RMS scaling, which scales its updates to match Adam’s learning rate range. In this PR, we already have this feature, but it’s only...
> Given Dxq found regressions in features that shouldnt have regress in 2.9.0 and that all the testing has been done on 2.8, there is almost no reason to upgrade....
> See comments Reverted both changes of UIState and BaseConfig, and included the fix inside Muon logic path