Step-Tuned SGD¶
Implements Step-Tuned SGD, an SGD variant that tunes its step size online from a second-order curvature estimate built out of gradient differences.
The method takes two consecutive half-steps per iteration and uses the change in parameters and gradients across the first half-step to approximate local curvature. An exponential moving average of these gradient differences yields a curvature vector, and the ratio of the squared parameter change to its inner product with that curvature gives a Barzilai-Borwein-style multiplier \(\gamma\). The multiplier is clipped to a safe range and reused on a slowly decaying base schedule, so only the base learning rate needs careful tuning.
where \(\theta\) are the parameters, \(\eta\) the base learning rate, \(g_t(\cdot)\) the minibatch gradient, \(\delta \in (0, 1/2)\) a decay exponent, \(\gamma_k\) the tuned step multiplier, \(\hat{G}_k\) the bias-corrected EMA of gradient differences with decay \(\beta\), \(\nu\) a fallback value for negative curvature, and \(\tilde{m}, \tilde{M}\) the clipping bounds.
Reference: Camille Castera, Jérôme Bolte, Cédric Févotte, Edouard Pauwels, "Second-order step-size tuning of SGD for non-convex optimization", arXiv 2021. https://arxiv.org/abs/2103.03570