Simplified-AdEMAMix¶
Implements Simplified-AdEMAMix, an AdEMAMix variant that keeps a single momentum buffer and adds the raw gradient.
AdEMAMix accelerates training by mixing a fast and a slow exponential moving average of the gradient, but this needs a second momentum buffer and a schedule for its decay rate. Simplified-AdEMAMix drops the slow EMA entirely. It keeps one accumulator \(m_t\) updated without the usual \((1-\beta_1)\) normalization, so \(m_t\) behaves like a slow, large-mass momentum, and recovers the responsiveness of a fast EMA by adding the current gradient \(g_t\) directly into the step with a fixed weight \(\alpha\). The second moment \(v_t\) and its bias correction are inherited from AdamW.
where \(\theta\) are the parameters, \(\eta_t\) the learning rate, \(g_t\) the gradient, \(m_t\) the unnormalized momentum accumulator, \(v_t\) the second moment with bias-corrected \(\hat{v}_t\), \(\beta_1,\beta_2\) the decay rates, \(\alpha\) the fixed weight on the current gradient, \(\lambda\) the decoupled weight decay, and \(\epsilon\) a stability constant. An optional warmup schedule may ramp \(\beta_1\) from \(\beta_{\mathrm{start}}\) over \(T_{\beta_1}\) steps.
Reference: Depen Morwani, Nikhil Vyas, Hanlin Zhang, Sham Kakade, "Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants", 2025. https://arxiv.org/abs/2502.02431