HTMuon¶
Implements HTMuon, Muon with a heavy-tailed spectral correction that raises the momentum matrix's singular values to a power \(p \in (0,1)\).
Muon orthogonalizes the momentum buffer, which is equivalent to setting all of its singular values to one and assigns equal weight to every singular-vector direction. This produces light-tailed update and weight spectra, which Heavy-Tailed Self-Regularization theory associates with weaker generalization. HTMuon keeps Muon's momentum and matrix structure but replaces the orthogonalization with a power transform of the singular values: from the SVD \(m_t = U_t \Sigma_t V_t^\top\) it forms \(U_t \Sigma_t^{p} V_t^\top\). Smaller \(p\) moves toward Muon's all-ones spectrum, while \(p \to 1\) recovers SGDM; the intermediate regime makes updates more heavy-tailed while retaining the matrix-based coupling between directions.
where \(\theta\) are the matrix-shaped parameters of shape \(m \times n\), \(g_t\) is the gradient, \(m_t\) the momentum buffer (\(m_0 = 0\)), \(\beta\) the momentum coefficient, \(U_t \Sigma_t V_t^\top\) the SVD of \(m_t\), \(p \in (0,1)\) the spectral power (default \(p = 0.125\); \(p = 0\) recovers Muon, \(p = 1\) gives SGDM), \(s\) the shape-dependent scaling factor, \(\eta\) the learning rate, and \(\lambda\) the decoupled weight decay.
Reference: Tianyu Pang, Yujie Fang, Zihang Liu, Shenyang Deng, Lei Hsiung, Shuhua Yu, Yaoqing Yang, "HTMuon: Improving Muon via Heavy-Tailed Spectral Correction", arXiv 2026. https://arxiv.org/abs/2603.10067