M-SVAG¶
Implements M-SVAG, a momentum optimizer that scales each coordinate of the update by an estimate of its gradient signal-to-noise ratio.
M-SVAG decouples the two effects fused inside Adam: a sign-based direction and a per-coordinate variance adaptation. Instead of dividing by \(\sqrt{v_t}\), it keeps the momentum direction \(m_t\) and multiplies it by a factor \(\gamma_t \in [0, 1]\) that shrinks coordinates whose stochastic gradient variance is large relative to the squared mean, leaving low-noise coordinates near full step size. The variance is estimated from the same exponential moving averages of \(g_t\) and \(g_t^2\), with a bias correction \(\rho(\beta_1, t)\) that accounts for the correlation between the moment estimates.
where \(m_t\) is the bias-corrected first moment, \(v_t\) the bias-corrected second moment, \(\hat{s}_t\) an unbiased estimate of the gradient variance, \(\rho(\beta_1, t)\) the bias-correction term tying the two estimates together, \(\gamma_t\) the per-coordinate variance-adaptation factor, \(\eta\) the learning rate, \(\beta_1\) the moment decay, and \(\odot\) elementwise multiplication.
Reference: Lukas Balles, Philipp Hennig, "Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients", ICML 2017. https://arxiv.org/abs/1705.07774