AdaBound¶
Implements AdaBound, Adam with a dynamic bound on the learning rate.
\[
\begin{aligned}
m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\
v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\
\hat{\eta}_t &= \mathrm{Clip}\!\left(
\frac{\eta \sqrt{1 - \beta_2^t}}{(1 - \beta_1^t)
(\sqrt{v_t} + \epsilon)},
\eta_l(t), \eta_u(t)\right) \\
\eta_l(t) &= \eta^* \left(1 - \frac{1}{\gamma t + 1}\right),
\quad \eta_u(t) = \eta^* \left(1 + \frac{1}{\gamma t}\right) \\
\theta_t &= \theta_{t-1} - \hat{\eta}_t \odot m_t
\end{aligned}
\]
where \(\eta^*\) is the final (SGD) learning rate. The lower and upper bounds converge to \(\eta^*\) as \(t \to \infty\), so AdaBound transitions smoothly from Adam to SGD.
Reference: Liangchen Luo, Yuanhao Xiong, Yan Liu, Xu Sun, "Adaptive Gradient Methods with Dynamic Bound of Learning Rate", ICLR 2019. https://arxiv.org/abs/1902.09843