Stable-SPAM / GradientStabilizer¶
Implements Stable-SPAM, an Adam variant that stabilizes low-precision training by adaptively clipping and normalizing gradients before the moment update.
Stable-SPAM augments Adam with three mechanisms. AdaClip detects spike entries whose magnitude exceeds an exponentially tracked, bias-corrected threshold and rescales them down to that threshold. AdaGN (the GradientStabilizer) normalizes the whole gradient by its current norm and reweights it by a bias-corrected ratio of first to second moments of past gradient norms, damping bursts that would otherwise destabilize 4-bit and BF16 optimizer states. Periodic momentum reset (every \(\Delta T\) steps the moments \(m,v\) are zeroed) keeps stale moments from amplifying instability. The cleaned gradient then drives a standard Adam update.
where \(\theta\) are the parameters, \(\eta\) the learning rate, \(g_t\) the gradient, \(\tilde g_t\) its spike-clipped version and \(\hat g_t\) the further normalized gradient, \(T_t\) the tracked spike threshold (\(\gamma_3 = 0.999\)), \(\mu_t,\nu_t\) the first and second moments of the gradient norm \(n_t\) (\(\gamma_1,\gamma_2\) controlling their decay), \(m_t,v_t\) the Adam moments with decays \(\beta_1,\beta_2\), \(\epsilon\) a stability constant, and the moments \(m,v\) are reset to zero every \(\Delta T\) steps.
Reference: Tianjin Huang, Haotian Hu, Zhenyu Zhang, Gaojie Jin, Xiang Li, Li Shen, Tianlong Chen, Lu Liu, Qingsong Wen, Zhangyang Wang, Shiwei Liu, "Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam", ICML 2025. https://arxiv.org/abs/2502.17055