μ²-SGD¶
Implements μ²-SGD, a stable stochastic optimizer built from a double momentum mechanism.
μ²-SGD couples two complementary momentum ideas. The query points \(x_t\) follow an Anytime-style weighted average of the iterates \(w_t\), while the gradient estimate \(d_t\) is a STORM-style variance-reduced (corrected) momentum that reuses the previous query point under the freshly drawn sample. With importance weights \(\alpha_t = t+1\) and decay \(\beta_t = 1/\alpha_t\), the estimation error shrinks as \(\mathcal{O}(1/t)\), allowing a large, near-constant effective step.
Each step draws a sample \(z_{t+1}\), evaluates the gradient at both the new query point \(x_{t+1}\) and the previous query point \(x_t\), and combines them:
where \(\theta\) is identified with the iterate \(w_t\), \(x_t\) is the averaged query point, \(d_t\) the corrected momentum gradient estimate, \(\eta\) the learning rate, \(\alpha_t = t+1\) the importance weights with \(\alpha_{1:t} = \sum_{\tau=1}^{t}\alpha_\tau\), \(\beta_t = 1/\alpha_t\) the momentum decay, \(\Pi_{\mathcal{K}}\) the projection onto the feasible set, and \(\bar{g}_t\) the gradient at the old query point under the new sample.
Reference: Kfir Y. Levy, "μ²-SGD: Stable Stochastic Optimization via a Double Momentum Mechanism", arXiv 2023. https://arxiv.org/abs/2304.04172