SKA-SGD¶
Implements SKA-SGD, stochastic gradient descent accelerated by projection onto a low-dimensional Krylov subspace.
To handle poorly-conditioned problems, SKA-SGD keeps a sliding window of the last \(s\) stochastic gradients and projects the current gradient onto the subspace they span. The projection coefficients solve a small regularized Gram system, which is computed by streaming Gauss-Seidel (equivalent to modified Gram-Schmidt) at \(O(s^2)\) cost rather than forming the full system explicitly. The resulting direction replaces the raw gradient in an ordinary (optionally Nesterov-momentum) SGD step.
where \(g_t\) is the stochastic gradient, \(P_t\) stacks the most recent \(s\) gradients as a Krylov basis, \(\alpha_t\) are the projection coefficients obtained by streaming Gauss-Seidel on the regularized Gram system, \(\lambda\) is the regularization parameter, \(d_t\) is the projected search direction, \(\eta\) is the learning rate, \(\beta\) is the momentum coefficient, and \(v_t\) is the velocity.
Reference: Stephen Thomas, "Streaming Krylov-Accelerated Stochastic Gradient Descent", arXiv 2025. https://arxiv.org/abs/2505.07046