AdamD¶
Implements AdamD, Adam with bias correction applied only to the second moment.
AdamD retains Adam's exponential moving averages of the gradient and squared gradient, but drops the first-moment bias-correction term \(1-\beta_1^t\) entirely, keeping only the well-justified second-moment correction \(\sqrt{1-\beta_2^t}\) folded into the step size. Because the early uncorrected first moment \(m_t\) is small, this yields conservative, monotonically increasing effective step sizes during the first steps of training and removes the need for learning-rate warmup.
where \(\theta\) are the parameters, \(\eta\) the base learning rate, \(g_t\) the gradient, \(m_t\) and \(v_t\) the first- and second-moment estimates, \(\beta_1,\beta_2\) the decay rates, and \(\epsilon\) a stability constant; note the absence of any \(1-\beta_1^t\) correction on \(m_t\).
Reference: John St John, "AdamD: Improved bias-correction in Adam", 2021. https://arxiv.org/abs/2110.10828