Online Subspace Descent¶
Implements Online Subspace Descent, a memory-efficient training method that projects gradients into a low-rank subspace whose projection matrix is itself learned online.
Like GaLore, the weight matrix is updated through a low-rank projection \(P_t\), so the optimizer state (e.g. Adam moments) lives in the reduced \(k\)-dimensional subspace rather than the full parameter space. Unlike GaLore, the projection matrix is not recomputed by periodic SVD; instead \(P_t\) is updated every step by one optimizer step on an online PCA objective that tracks the current gradient. This continuous, dynamics-based update admits a Hamiltonian descent interpretation that guarantees convergence to stationary points for arbitrary smooth choices of the projection dynamics.
where \(\theta\) is a weight matrix in \(\mathbb{R}^{n\times m}\), \(g_t = \nabla_\theta L(\theta_t)\) its gradient, \(P_t \in \mathbb{R}^{n\times k}\) (\(k \ll n\)) the learned projection, \(\hat{g}_t\) the projected gradient, \(\hat{\Delta}_t\) the subspace optimizer update with state \(\hat{S}_t\), \(\Delta^P_t\) the optimizer update for \(P\) with state \(S^P_t\), \(\eta^W_t,\eta^P_t\) the learning rates, and \(\lambda^W,\lambda^P,\lambda\) weight-decay / orthogonality coefficients. The authors recommend Adam for both optimizers with \(\eta^P_t = \alpha\,\eta^W_t\) (e.g. \(\alpha=5\)) and \(\lambda^W=\lambda^P\).
Reference: Kaizhao Liang, Bo Liu, Lizhang Chen, Qiang Liu, "Memory-Efficient LLM Training with Online Subspace Descent", NeurIPS 2024. https://arxiv.org/abs/2408.12857