RACS / Alice¶
Implements RACS and Alice, structured Fisher-approximation optimizers for memory-efficient LLM training.
Both stem from a structured approximation of the Fisher information. RACS (Row And Column Scaled SGD) preconditions the gradient \(G_t\) by two diagonal scalings, one acting on rows and one on columns, each maintained by an exponential moving average and estimated through a fixed-point iteration. Alice extends this with a low-rank subspace: it projects the gradient onto a leading eigenbasis \(U_t\) (refreshed every \(K\) steps by subspace switching), runs Adam-style moments inside that subspace, and adds a compensation term that recovers signal from the discarded complement directions.
For RACS, with diagonal scalings \(s_t,q_t\) tracking \(\mathrm{Diag}(S_t),\mathrm{Diag}(Q_t)\):
For Alice, with projected gradient \(\sigma_t = U_t^{\top} G_t\):
where \(W\) are the weights, \(G_t\) the gradient, \(\lambda\) the learning rate, \(\alpha\) a scaling factor, \(\beta,\beta_1,\beta_2\) EMA decay rates, \(S_t,Q_t\) the column/row preconditioner estimates, \(\gamma\) a norm-growth limit with running norm \(\phi_t\), \(U_t\) the rank-\(r\) subspace basis (for an \(m\times n\) weight), \(\sigma_t^{\odot 2}\) the elementwise square, \(p_t\) the complement scaling, \(C_t\) the compensation for discarded directions weighted by \(\alpha_c\), and \(\epsilon\) a small constant inside the square roots for stability.
Reference: Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds, "Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension", arXiv 2025. https://arxiv.org/abs/2502.07752