COSMOS¶
Implements COSMOS, a hybrid adaptive optimizer that splits the gradient matrix into a leading eigensubspace handled SOAP-style and a complementary subspace handled Muon-style.
COSMOS exploits the observation that eigensubspaces of the gradient matrix carry uneven importance. For each matrix-shaped parameter it tracks a low-rank eigenbasis \(U_t\) (rank \(r \ll n\)) of the gradient second moment, updated by a power-iteration-with-QR step. Inside this rank-\(r\) subspace it applies a SOAP-like preconditioned update \(A_t\) using a projected second moment \(V_t\) with Adam-style bias correction. The residual energy outside the subspace, \(M_t - M_t U_t U_t^\top\), is handled by a Muon-like update \(B_t\): Frobenius normalization followed by Newton-Schulz orthogonalization. The two pieces are linearly combined and the whole step is rescaled to unit per-element scale before being applied, keeping memory at \(O(mr + nr)\) per layer instead of the full \(O(mn)\) preconditioner.
where \(W_t\) are the parameters, \(G_t\) the gradient of shape \(m \times n\), \(M_t\) the first-moment EMA, \(U_t \in \mathbb{R}^{n\times r}\) the leading eigenbasis with \(S_t \in \mathbb{R}^{r\times r}\) its second-moment projection, \(V_t\) the second moment within the subspace, \(\eta\) the learning rate, \(\gamma\) the combination weight, \(\beta_1,\beta_2\) the moment decay rates, \(\epsilon\) a stability constant, \(\mathrm{NS5}(\cdot)\) five Newton-Schulz orthogonalization iterations, and \(\mathrm{NORM}(X) = \sqrt{n}\, X / \lVert X\rVert_F\) the Frobenius rescaling.
Reference: Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, Tuo Zhao, "COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs", arXiv 2025. https://arxiv.org/abs/2502.17410