AdamS¶
Implements AdamS, an Adam variant that normalizes by the momentum itself instead of a separate second-moment estimate.
AdamS keeps Adam's exponential momentum but replaces the squared-gradient running average in the denominator with a blend of the squared previous momentum and the squared current gradient. This eliminates the second-moment state entirely, matching the memory footprint of SGD with momentum while retaining adaptive per-coordinate scaling. The denominator \(\beta_2 m_{t-1}^2 + (1-\beta_2) g_t^2\) uses the previous momentum \(m_{t-1}\) as a low-variance stand-in for the gradient scale. No bias correction is applied, and weight decay is decoupled.
where \(\theta\) are the parameters, \(\eta\) the learning rate, \(g_t\) the gradient, \(m_t\) the first moment, \(v_t\) the momentum-based normalizer, \(\beta_1,\beta_2\) the decay rates, \(\lambda\) the weight decay, and \(\epsilon\) a stability constant. All squares and the division act element-wise.
Reference: Huishuai Zhang, Bohan Wang, Luoxin Chen, "AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training", 2025. https://arxiv.org/abs/2505.16363