AdaMod¶
Implements AdaMod, an Adam variant that bounds the per-parameter learning rates by an exponential moving average of their past values.
\[
\begin{aligned}
m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\
v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\
\eta_t &= \frac{\alpha \sqrt{1 - \beta_2^t}}{(1 - \beta_1^t)
(\sqrt{v_t} + \epsilon)} \\
s_t &= \beta_3 s_{t-1} + (1 - \beta_3) \eta_t \\
\hat{\eta}_t &= \min(\eta_t, s_t) \\
\theta_t &= \theta_{t-1} - \hat{\eta}_t \odot m_t
\end{aligned}
\]
The adaptive learning rate \(\eta_t\) computed by Adam is smoothed by a third exponential moving average \(s_t\) with decay \(\beta_3\), and each element of the update is capped at this momental bound. This restrains the large learning rates that can appear early in training.
Reference: Jianbang Ding, Xuancheng Ren, Ruixuan Luo, Xu Sun, "An Adaptive and Momental Bound Method for Stochastic Learning", 2019. https://arxiv.org/abs/1910.12249