AdamW¶
Implements AdamW, Adam with weight decay decoupled from the gradient update.
In standard Adam, L2 regularization is folded into the gradient, so the weight decay is rescaled by the per-coordinate adaptive learning rate and its effect on parameters with large second moments is suppressed. AdamW removes the decay term from the gradient and instead subtracts \(\eta \lambda \theta_{t-1}\) directly from the parameters at each step. The adaptive moment estimates \(m_t\) and \(v_t\) are therefore computed from the raw gradient, and the regularization acts uniformly across coordinates.
where \(\theta\) are the parameters, \(\eta\) is the learning rate, \(g_t\) is the gradient, \(m_t\) and \(v_t\) are the first and second moment estimates, \(\beta_1, \beta_2\) are the decay rates, \(\lambda\) is the weight decay coefficient, and \(\epsilon\) is a numerical-stability term.
Reference: Ilya Loshchilov and Frank Hutter, "Decoupled Weight Decay Regularization", ICLR 2019. https://arxiv.org/abs/1711.05101