Skip to content

AdamD

Implements AdamD, Adam with bias correction applied only to the second moment.

AdamD retains Adam's exponential moving averages of the gradient and squared gradient, but drops the first-moment bias-correction term \(1-\beta_1^t\) entirely, keeping only the well-justified second-moment correction \(\sqrt{1-\beta_2^t}\) folded into the step size. Because the early uncorrected first moment \(m_t\) is small, this yields conservative, monotonically increasing effective step sizes during the first steps of training and removes the need for learning-rate warmup.

\[ \begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \\ \eta_t &= \eta \sqrt{1-\beta_2^t} \\ \theta_t &= \theta_{t-1} - \eta_t \frac{m_t}{\sqrt{v_t} + \epsilon} \end{aligned} \]

where \(\theta\) are the parameters, \(\eta\) the base learning rate, \(g_t\) the gradient, \(m_t\) and \(v_t\) the first- and second-moment estimates, \(\beta_1,\beta_2\) the decay rates, and \(\epsilon\) a stability constant; note the absence of any \(1-\beta_1^t\) correction on \(m_t\).

Reference: John St John, "AdamD: Improved bias-correction in Adam", 2021. https://arxiv.org/abs/2110.10828


Back to the Canon