Muon^2¶
Implements Muon\(^2\), Muon with Adam-style second-moment preconditioning applied before orthogonalization.
Muon orthogonalizes the momentum matrix via Newton-Schulz iterations, but its convergence is limited by the ill-conditioned spectrum of that momentum. Muon\(^2\) rescales the momentum element-wise by a running second moment of the gradient, sharpening the spectrum so that fewer Newton-Schulz iterations are needed to reach a sufficiently orthogonal update.
For a parameter matrix \(\theta \in \mathbb{R}^{n \times m}\) with gradient \(G_t\):
where \(M_t\) is the momentum, \(V_t\) the second-moment accumulator, \(\odot\) and \(\oslash\) are element-wise product and division, \(\sqrt{\cdot}\) acts element-wise, \(\mathrm{NewtonSchulz}(\cdot, K)\) applies \(K\) orthogonalization iterations, \(\eta\) is the learning rate, \(\sqrt{m/n}\) is a dimension-aware scaling factor, \(\beta_1, \beta_2\) are decay rates, and \(\epsilon\) is a stability constant.
Reference: Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Yequan Zhao, Yupeng Su, Zi Yang, Zheng Zhang, "Muon\(^2\): Boosting Muon via Adaptive Second-Moment Preconditioning", arXiv 2026. https://arxiv.org/abs/2604.09967