VR-SGD¶
Implements VR-SGD, a stochastic variance-reduced method that decouples the snapshot point from the epoch starting point.
VR-SGD runs in epochs. Once per epoch it computes a full gradient \(\mu_s\) at a snapshot \(\tilde\theta_{s-1}\), then performs \(m\) inner steps using the SVRG-style variance-reduced estimator. Its defining feature is that the snapshot point and the starting point of each epoch are chosen differently: the snapshot is the average of the previous epoch's iterates, while the next epoch starts from the last iterate. This decoupling, together with a learning rate that can be increased over early epochs, lets VR-SGD use a much larger step size than SVRG.
where \(\theta_k\) are the inner iterates of epoch \(s\), \(\tilde\theta_{s-1}\) is the snapshot (the average over the previous epoch), \(\mu_s\) is the full gradient at that snapshot, \(i_k\) is a uniformly sampled index, \(\eta_s\) is the per-epoch learning rate (optionally grown via \(\eta_s = \eta_0/\max\{\alpha,\,2/(s+1)\}\) with \(\alpha\in(0,1]\)), \(n\) is the number of components, \(m\) the inner iterations per epoch, and \(g\) a possibly non-smooth regularizer (the smooth case drops \(\nabla g\)).
Reference: Fanhua Shang, Kaiwen Zhou, Hongying Liu, James Cheng, Ivor W. Tsang, Lijun Zhang, Dacheng Tao, Licheng Jiao, "VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning", arXiv 2018. https://arxiv.org/abs/1802.09932