MUD¶
Implements MUD (MomentUm Decorrelation), a triangular whitening surrogate for Muon's polar update on matrix-shaped parameters.
MUD targets the 2D weight matrices of a transformer (other parameters such as embeddings and biases use AdamW). It keeps a momentum buffer with a Nesterov-style lookahead, then replaces Muon's orthogonalization with a cheaper Cholesky-like decorrelation: the lookahead matrix is row-normalized, its Gram matrix is formed, the lower triangle is solved against the rows, and the result is renormalized. Repeating this for a small number of passes \(p\) (typically one) approximately whitens the update at a fraction of Muon's cost. The whitened direction is then applied with a shape-dependent scale and decoupled weight decay.
where \(\theta\) are the matrix parameters, \(G_t\) the gradient, \(V_t\) the momentum buffer, \(M_t\) the Nesterov lookahead direction, \(\beta\) the momentum coefficient, \(\eta\) the learning rate, \(\lambda\) the decoupled weight decay, \(\epsilon\) a stability constant, \(\mathrm{tril}(\cdot)\) the lower-triangular part, \(p\) the number of decorrelation passes, \(Q_t\) the whitened update after \(p\) passes, and \(s(\theta) = 0.2\sqrt{\max(n,m)}\) a shape-dependent scale for an \(n \times m\) matrix.
Reference: Ben S. Southworth, Stephen Thomas, "Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training", arXiv 2026. https://arxiv.org/abs/2603.17970