Yogi¶
Implements Yogi, an adaptive method that controls the increase in the effective learning rate.
Yogi replaces the multiplicative second-moment update of Adam with an additive, sign-based one, so that the second moment can decrease as well as increase and large gradients do not cause it to grow uncontrollably.
\[
\begin{aligned}
m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\
v_t &= v_{t-1} - (1 - \beta_2)
\mathrm{sign}(v_{t-1} - g_t^2) \, g_t^2 \\
\hat{v}_t &= v_t / (1 - \beta_2^t) \\
\theta_t &= \theta_{t-1} - \frac{\eta}{1 - \beta_1^t}
\frac{m_t}{\sqrt{\hat{v}_t} + \epsilon}
\end{aligned}
\]
Note: Following the upstream implementation, both the first moment \(m_0\) and second moment \(v_0\) are initialized to initial_accumulator (default 1e-6) rather than to zero.
Reference: Manzil Zaheer, Sashank J. Reddi, Devendra Sachan, Satyen Kale, Sanjiv Kumar, "Adaptive Methods for Nonconvex Optimization", NeurIPS 2018. https://papers.nips.cc/paper_files/paper/2018/hash/90365351ccc7437a1309dc64e4db32a3-Abstract.html