Adam+¶
Implements Adam+, a stochastic method that adds adaptive variance reduction on top of Adam-style momentum.
Adam+ keeps a single moving average \(z_t\) of the gradient (a first-moment estimate) but evaluates the gradient at an extrapolated point \(\hat\theta_{t+1}\) rather than the current iterate. This extrapolation acts as a variance-reduction mechanism, while the step size is normalized by the square root of the norm of \(z_t\), replacing Adam's coordinate-wise second moment with a single adaptive scale.
where \(\theta\) are the parameters, \(z_t\) is the first-moment estimate (initialized as \(z_0 = g_0(\theta_0)\)), \(g_{t+1}(\hat\theta_{t+1})\) is the stochastic gradient evaluated at the extrapolated point, \(\alpha \ge 1\) is the step-size parameter, \(\beta \in (0,1)\) is the moment decay rate, \(a \ge 1\) is the exponent applied to \(\beta\), and \(\epsilon\) is a small stability constant (defaults \(\alpha=0.1\), \(a=1\), \(\beta=0.1\), \(\epsilon=10^{-8}\)).
Reference: Mingrui Liu, Wei Zhang, Francesco Orabona, Tianbao Yang, "Adam+: A Stochastic Method with Adaptive Variance Reduction", arXiv 2020. https://arxiv.org/abs/2011.11985