LoRA-Muon¶
Implements LoRA-Muon, Muon's spectral steepest descent specialized to the low-rank manifold of LoRA factors.
LoRA finetuning writes the adapted weight as \(W = AB^\top\) with factors \(A \in \mathbb{R}^{m\times r}\), \(B \in \mathbb{R}^{n\times r}\). Applying factor-wise optimizers such as AdamW makes learning rates transfer poorly across rank and scale. LoRA-Muon instead solves the spectral-norm steepest-descent problem on the fixed-rank manifold \(\mathcal{M}_r = \{W : \mathrm{rank}(W)=r\}\), so the update is the Muon update of the product \(W\) projected onto the tangent space. The trust-region budget is split evenly between the two tangent components \(\Delta A\, B^\top\) and \(A\, \Delta B^\top\), each side whitened by the current Gram geometry \(S_A = A^\top A\), \(S_B = B^\top B\) before the matrix-sign step. A split weight-decay rule applies decay to the composed weight \(W\) rather than to each factor, keeping step sizes matched to full-rank Muon.
where \(A,B\) are the LoRA factors, \(W_{\mathrm{pre}}\) is the frozen base weight, \(\eta_t\) is the learning rate, \(\beta\) the momentum, \(\lambda\) the weight decay, \(m_t^A,m_t^B\) the factor first moments, \(S_A,S_B\) the factor Gram matrices with inverse square roots \(R_A,R_B\), and \(\mathrm{msign}(X)=UV^\top\) for the SVD \(X=U\Sigma V^\top\) (the spectral-norm linear minimization oracle, realized by Newton-Schulz iteration without an explicit SVD).
Reference: Franz Louis Cesista, Cédric Simal, Katherine Crowson, Stella Biderman, "LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold", arXiv 2026. https://arxiv.org/abs/2606.12921