MARS-M¶
Implements MARS-M, a matrix-aware variance-reduced optimizer that brings MARS-style gradient correction to the Muon update.
MARS-M forms a corrected gradient \(c_t\) by adding a scaled difference between the current gradient and the previous gradient evaluated at the same minibatch, which reduces stochastic variance. The corrected gradient is clipped to unit norm, accumulated into a heavy-ball momentum matrix, and then orthogonalized via a Newton-Schulz iteration before the decoupled-weight-decay parameter step, so the matrix structure of the layer is preserved exactly as in Muon.
where \(\theta\) are the (matrix) parameters with dimensions \(m \times n\), \(g_t = \nabla f(\theta_t, \xi_t)\) and \(g_{t-1} = \nabla f(\theta_{t-1}, \xi_t)\) are gradients on the same minibatch \(\xi_t\), \(\gamma_t\) is the variance-reduction scaling, \(\beta\) is the momentum coefficient, \(\mathrm{NewtonSchulz}(\cdot)\) approximates the orthogonalization \(U V^\top\) of \(m_t = U \Sigma V^\top\), \(\eta_t\) is the learning rate, and \(\lambda\) is the decoupled weight decay.
Reference: Yifeng Liu, Angela Yuan, Quanquan Gu, "MARS-M: When Variance Reduction Meets Matrices", 2025. https://arxiv.org/abs/2510.21800