Natural GaLore¶
Implements Natural GaLore, a low-rank gradient projection method that applies an inverse-Fisher (natural gradient) correction within the projected subspace.
Like GaLore, the gradient is projected onto a low-rank subspace spanned by the top singular vectors \(P_t\) of the gradient, and the optimizer state lives in that compact space. Natural GaLore adds a second-order step: an empirical Fisher matrix is built from a sliding window of recent projected gradients, and its inverse is applied to the current projected gradient via the Woodbury identity, so the natural-gradient direction is computed cheaply by solving a small \(s \times s\) system. Adam-style moments are then run on the corrected gradient, and the result is projected back to full dimension.
where \(P_t\) are the top-\(r\) left singular vectors of the gradient (the projection onto the low-rank subspace), \(g_t\) is the projected gradient, \(G_t\) is the window of the last \(s\) projected gradients forming the empirical Fisher \(\lambda I + G_t G_t^\top\), \(\tilde{g}_t\) is the natural (inverse-Fisher-preconditioned) gradient obtained via the Woodbury identity, \(\lambda\) is the Tikhonov regularization constant, \(m_t,v_t\) are the first and second moments with decay rates \(\beta_1,\beta_2\), \(\eta\) is the learning rate, and \(\epsilon\) ensures numerical stability.
Reference: Arijit Das, "Natural GaLore: Accelerating GaLore for Memory-Efficient LLM Training and Fine-tuning", arXiv 2024. https://arxiv.org/abs/2410.16029