GUM¶
Implements GUM, an unbiased low-rank gradient projection built on Muon.
Low-rank projection methods such as GaLore compress the optimizer state by projecting gradients onto the top-\(r\) left singular subspace \(P_t\) of the gradient, but discarding the orthogonal complement makes the projected update a biased estimate of the full gradient. GUM removes this bias by stochastically choosing, per layer and per period, between a low-rank update on the captured subspace and a full-rank update on its complement, then importance-reweighting each branch so the expected update equals the full-gradient Muon step. Each branch orthogonalizes its accumulated momentum with the Newton-Schulz iteration, inheriting Muon's matrix-sign update while keeping low-rank memory cost on the dominant branch.
For each period \(t\) a projector \(P_t = U_t[:,{:}r]\) is taken from the SVD of the period's first gradient, and each layer is sampled for a full-rank update with probability \(q\). With \(\mathrm{NS}(\cdot)\) the Newton-Schulz orthogonalization:
where \(W\) is the weight matrix, \(G_{t,k}\) the gradient at iteration \(k\) of period \(t\), \(P_t\) the rank-\(r\) projector, \(\beta\) the momentum coefficient, \(\eta\) the learning rate, \(q=\gamma/N_L\) the per-layer full-rank sampling probability (\(\gamma\) full-rank layers out of \(N_L\) total), and the factors \(\tfrac{1}{1-q}\) and \(\tfrac{1}{q}\) importance-reweight the two branches so the expected update is unbiased.
Reference: Rui Pan, Yang Luo, Yuxing Liu, Yang You, Tong Zhang, "Unbiased Gradient Low-Rank Projection", arXiv 2025. https://arxiv.org/abs/2510.17802