Muon-OGD¶
Implements Muon-OGD, a Muon-style spectral-norm update with orthogonal non-interference constraints for continual learning.
Muon-OGD adapts a model to a new task while protecting prior tasks. Each step is cast as a spectral-norm-constrained optimization: descend along the (momentum) gradient \(G_t\) under an operator-norm trust region \(\lVert \Delta \rVert_2 \le \eta\), subject to non-interference constraints \(\langle C_i, \Delta\rangle = 0\) that keep the update orthogonal to past-task directions \(C_i\). Unlike Frobenius-norm orthogonal projection, the spectral-norm geometry inherits Muon's operator-norm steepest descent.
The constrained primal has a dual in which one descends on the multipliers \(\lambda_i\) of the constraints; the shifted matrix \(H = G + \sum_i \lambda_i C_i\) is orthogonalized by the matrix sign function \(\mathrm{msgn}(\cdot) = UV^\top\) (the polar factor), approximated by five Newton-Schulz iterations \(\mathrm{NS5}\). The full update over an inner dual loop \(m = 0,\dots,T_{\mathrm{in}}-1\) is:
where \(\theta\) are the parameters, \(\eta\) the learning rate (operator-norm trust-region radius), \(g_t = \nabla_\theta \ell(\theta_t; \mathcal{S}_t)\) the gradient on mini-batch \(\mathcal{S}_t\), \(G_t\) the momentum matrix, \(\beta\) the momentum coefficient, \(C_i\) the past-task constraint directions, \(\lambda_i\) their dual multipliers, \(\eta_\lambda\) the dual step size, \(\lVert \cdot \rVert_2\) the spectral (induced matrix-2) norm, and \(\mathrm{NS5}\) five Newton-Schulz iterations of \(f(X) = \tfrac{3}{2}X - \tfrac{1}{2}XX^\top X\) approximating the matrix sign \(\mathrm{msgn}(X) = UV^\top\).
Reference: Binghang Lu, Zheyuan Deng, Runyu Zhang, Bing Hu, Yunhan Zhao, Yuan Tian, Changhong Mou, Guang Lin, Xiaomin Li, "Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning", arXiv 2026. https://arxiv.org/abs/2605.08949