MuonAll¶
Implements MuonAll, a Muon variant that routes every parameter, including 1D ones, through the orthogonalized update.
Standard Muon applies its Newton-Schulz orthogonalization only to 2D weight matrices and leaves 1D parameters (biases, norms, embeddings) to a separate optimizer. MuonAll instead embeds each 1D parameter as the main diagonal of a square matrix, runs it through the same momentum-plus-orthogonalization pipeline, and maps the result back to a vector, so a single rule governs all parameters during finetuning.
where \(\theta\) are the parameters (a 1D parameter is first reshaped to \(\mathrm{diag}(\theta)\) and the orthogonalized update reshaped back), \(g_t\) is the gradient, \(B_t\) is the momentum buffer with decay \(\mu\), \(\eta\) is the learning rate, and \(\mathrm{NewtonSchulz5}\) is the quintic Newton-Schulz iteration that approximates the orthogonal factor \((M_t M_t^\top)^{-1/2} M_t = U V^\top\).
Reference: Saurabh Page, Advait Joshi, S. S. Sonawane, "MuonAll: Muon Variant for Efficient Finetuning of Large Language Models", arXiv 2025. https://arxiv.org/abs/2511.06086