SF-NorMuon¶
Implements SF-NorMuon, a schedule-free variant of NorMuon that needs no decay schedule and yields a well-trained model at every step.
SF-NorMuon fuses three ingredients. From Muon it keeps the polar transform: the momentum buffer is orthogonalized via Newton-Schulz so the update is the steepest-descent direction under the spectral norm. From NorMuon it adds row-wise normalization: a running second moment of the squared polar update is tracked per row and used to rescale it, much like Adam's RMS normalization. From schedule-free optimization it borrows two coupled sequences — a fast iterate \(z_t\) that moves at a constant learning rate, and an averaged iterate \(x_t\) (the output, where loss is evaluated) formed by an online weighted average. Gradients are queried at an interpolation \(y_t\) of the two.
Because the averaging weight \(c_{t+1}=\eta_t^2/s_t\) is driven by the cumulative sum of squared learning rates, no learning-rate decay schedule is required; the method is "anytime," producing a deployable model at any stopping point.
where \(z_t\) is the fast iterate, \(x_t\) the averaged (output) iterate, \(y_t\) the interpolation point at which the gradient \(g_t\) is taken, \(m_t\) the momentum buffer with decay \(\mu\), \(\mathrm{polar}(\cdot)\) the orthogonal polar factor \(UV^\top\) of the SVD (computed by Newton-Schulz), \(\mathrm{mean_{cols}}\) the per-row mean across columns and \(\mathrm{ExpandRows}\) its broadcast back to the \(m\times n\) matrix shape, \(v_t\) the row-wise second moment with decay \(\beta_2\), \(\hat{P}_t\) the row-normalized update, \(\beta_1\) the interpolation weight, \(\eta\) the base learning rate with warmup over \(T_{\mathrm{warmup}}\) steps, \(\hat{\eta}_t\) the RMS-matched effective step, \(\lambda\) the weight decay, \(s_t\) the cumulative sum of squared learning rates, \(c_{t+1}\) the schedule-free averaging coefficient, \(\odot\) and \(\oslash\) element-wise product and division, and \(\epsilon\) a stability constant.
Reference: Anuj Apte, Pranav Deshpande, Niraj Kumar, Shouvanik Chakrabarti, Junhyung Lyle Kim, "Anytime Training with Schedule-Free Spectral Optimization", arXiv 2026. https://arxiv.org/abs/2605.23061