Compiled Optimizers: Accelerate all Advanced Optimizers with pre-compilation
This Pull Request introduces a new boolean option, Compiled Optimizer, to all advanced optimizers, allowing the core update logic to be compiled using torch.compile (Tested on PyTorch 2.8).
By using torch.compile, we can fuse operations and optimize the computational graph, resulting in significant performance improvements in high-throughput or heavily parallel environments.
Includes: #1020 and #1064
When to use:
-
Using features that add noticeable overhead on the optimizer side; with
torch.compile, their overhead becomes unnoticeable:- OrthoGrad: This introduces 33% overhead for small BS.
- 1-bit Factored mode: This also introduces some overhead.
- 3-state optimizers like AdEMAMix: more states = more optimizer calculations.
-
Full Finetuning: Larger models might be slower in optimizer-side calculations.
-
Orthogonal Optimizers: Muon and AdaMuon have noticeable overhead in their orthogonalization ops. Using
torch.compileshould reduce this overhead.
Usage
git fetch origin pull/1083/head:compile_optm
git checkout compile_optm
Run install.bat or update.bat
TODO
- [ ] Ensure backward compatibility with older backups.
Known Issues
Thanks to @dxqb for initial support and helpful insights!