OptMuon¶
Implements OptMuon, a Muon-style orthogonalized momentum method with a closed-loop, self-normalized step-size schedule.
OptMuon keeps the Muon update structure -- the search direction is the polar factor \(\mathrm{Orth}(M_t)\) of a momentum matrix \(M_t\) -- but replaces the fixed or open-loop magnitude rule with a trajectory-dependent, AdaGrad-Norm-style coefficient. A lagged self-normalized coefficient \(\alpha_t = \rho_{t-1}\) is built from the running gradient-norm history; its numerator carries a running maximum that compensates for occasional gradient spikes, so the step does not collapse after a single large gradient. Direction and magnitude are thus cleanly separated: the polar factor sets the direction, while the scalar \(\theta\gamma_t\|M_t\|_F\) sets the magnitude.
The framework has two variants sharing the same orthogonalized template. Option A (average smoothness, \(q=1/2\)) accumulates a single stochastic gradient per step; Option I (individual smoothness, \(q=2/3\)) uses a STORM-type recursive momentum estimator with two gradients on the same mini-batch. The polar factor is computed exactly via SVD in the analysis and approximated by a few Newton-Schulz iterations in practice.
where \(X\) are the matrix parameters, \(\theta\) the learning rate, \(G_t\) the stochastic gradient with norm \(g_t\), \(M_t\) the momentum matrix, \(\alpha_t\) the lagged self-normalized coefficient, \(\gamma_t\) the closed-loop scalar step factor, \(q\) the smoothness-regime exponent (\(1/2\) for average, \(2/3\) for individual smoothness), and \(\mathrm{Orth}(M) = U W^\top\) the polar factor from the thin SVD \(M = U \Sigma W^\top\) (with \(\mathrm{Orth}(0) = 0\)).
Reference: Ganzhao Yuan, "OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality", arXiv 2026. https://arxiv.org/abs/2606.08783