Optimal Low-Rank SGE¶
Implements Optimal Low-Rank SGE, a memory-efficient optimizer that descends in randomly projected low-rank subspaces with the projector chosen to minimize gradient-estimator variance.
At each outer step the parameters \(\theta\) are confined to an \(r\)-dimensional subspace spanned by a sampled projection matrix \(V_t\). A small low-rank coordinate \(B\) is updated by \(K\) inner gradient steps inside that subspace and then lifted back to the full space, so only the \(r\)-column factor is kept in memory rather than the full gradient. The projector is sampled so that \(\mathbb{E}[V_t V_t^\top] = c\,I_n\) (weak unbiasedness); the contribution is an instance-dependent sampling distribution that allocates the rank budget across spectral directions to attain the minimum mean-squared error of the estimated gradient.
where \(\theta_t\) are the parameters, \(\eta_t\) the step size, \(F\) the per-sample loss on data \(\xi_{t,k}\), \(B\) the low-rank inner coordinate of width \(r\), and \(V_t\) the sampled projection matrix; \(\Sigma = Q\,\mathrm{diag}(\sigma_1,\dots,\sigma_n)\,Q^\top\) is the spectral decomposition of the gradient covariance, \(\pi_i^\star\) the optimal inclusion probabilities, \(\tau = \#\{i:\pi_i^\star=1\}\), \(J\subset\{1,\dots,n\}\) a sampled index set of size \(r\) with \(\Pr(i\in J)=\pi_i^\star\), and \(c\) the isotropy constant.
Reference: Zehao Li, Tao Ren, Zishi Zhang, Xi Chen, Yijie Peng, "Optimal Low-Rank Stochastic Gradient Estimation for LLM Training", arXiv 2026. https://arxiv.org/abs/2603.20632