NAdam¶
Implements Nadam, an Adam variant that folds Nesterov momentum into the first-moment estimate.
Nadam keeps Adam's running averages of the gradient \(m_t\) and the squared gradient \(v_t\), but replaces Adam's bias-corrected first moment with a Nesterov-style lookahead: the update mixes the freshly decayed moment \(m_t\) with the current gradient \(g_t\), so the step anticipates where the momentum is heading rather than relying only on past accumulation. The mixing uses a per-step momentum coefficient \(\mu_t\) that warms up through a schedule governed by the momentum decay \(\psi\), and the corresponding running product \(\prod_i \mu_i\) supplies the bias correction.
where \(\theta\) are the parameters, \(\gamma\) is the learning rate, \(g_t\) is the gradient, \(m_t\) and \(v_t\) are the first and second moments, \(\beta_1, \beta_2\) are the decay rates, \(\mu_t\) is the scheduled momentum coefficient set by the momentum decay \(\psi\), and \(\epsilon\) is a small constant for numerical stability.
Reference: Timothy Dozat, "Incorporating Nesterov Momentum into Adam", ICLR Workshop 2016. https://openreview.net/forum?id=OM0jvwB8jIp57ZJjtNEZ