Skip to content

FiraAdamW

Implements FiraAdamW, AdamW with full-rank updates under a GaLore-style low-rank optimizer-memory budget.

\[ \begin{aligned} &P_t = U[:, {:}r] \quad \text{where} \quad U S V^\top = \mathrm{SVD}(g_t) \quad \text{(recomputed every $T$ steps)} \\ &r_t = P_t^\top g_t \\ &m_t = \beta_1 m_{t-1} + (1 - \beta_1)\, r_t \\ &v_t = \beta_2 v_{t-1} + (1 - \beta_2)\, r_t^2 \\ &\eta_t = \eta\, \sqrt{1 - \beta_2^t} / (1 - \beta_1^t) \\ &\psi_t = m_t / (\sqrt{v_t} + \epsilon) \\ &(\phi_t)_i = \lVert (\psi_t)_{:,i} \rVert_2 \, / \, \lVert (r_t)_{:,i} \rVert_2 \\ &S_t = \phi_t \odot (g_t - \alpha P_t r_t) \\ &S_t \leftarrow \gamma\, S_t\, \lVert S_{t-1} \rVert / \lVert S_t \rVert \quad \text{if } \lVert S_t \rVert > \gamma \lVert S_{t-1} \rVert \\ &\theta_{t+1} = (1 - \eta\lambda)\, (\theta_t - \eta_t\, (\alpha P_t \psi_t + S_t)) \end{aligned} \]

where \(r\) is the projection rank, \(T\) the subspace change frequency (update_proj_gap), \(\alpha\) the alpha scale factor, \(\gamma = 1.01\) the norm-growth limit, and \(\lambda\) the decoupled weight decay, applied after the gradient step as upstream does. Bias correction is folded into the step size \(\eta_t\). The Adam statistics \(m_t, v_t\) live in the rank-\(r\) subspace, as in GaLore; the residual gradient \(g_t - \alpha P_t r_t\) outside the subspace is applied with the norm-based scaling \(\phi_t\), one factor per column of \(r_t\) (per row when the gradient is projected from the right), so the full-rank direction is trained without full-rank optimizer state. The norm-growth limiter caps the step-to-step growth of \(\lVert S_t \rVert\) at \(\gamma\) to suppress loss spikes.

Note: Projection is enabled per parameter group: groups carrying rank, update_proj_gap, alpha, and proj_type keys are projected (2D parameters only), all other groups get plain AdamW.

Reference: Xi Chen, Kaituo Feng, Changsheng Li, Xunhao Lai, Xiangyu Yue, Ye Yuan, Guoren Wang, "Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?", NeurIPS 2025. https://arxiv.org/abs/2410.01623


Back to the Canon