Anon¶
Implements Anon, a unified adaptive optimizer with a tunable adaptivity exponent that interpolates between SGD and Adam and extrapolates beyond both.
Anon keeps the usual first and second moments, but raises the second moment to a tunable power \(\gamma\) before forming the preconditioner, so \(\gamma\) continuously controls how adaptive the step is (\(\gamma \approx 0\) recovers SGD-like behavior, \(\gamma \approx 1\) recovers Adam-like behavior). The preconditioner is refreshed only at logarithmically spaced steps through an Infrequent Decoupled Update: the accumulated second moment is collapsed into a new preconditioner via a harmonic mean with the previous one, then the accumulator is reset. Between refreshes the same preconditioner is reused, which decouples the adaptation cadence from the per-step update and stabilizes the geometry.
where \(\theta\) are the parameters, \(\eta(t)\) the (possibly scheduled) learning rate, \(g_t\) the gradient, \(m_t\) and \(s_t\) the first and second moments with decays \(\beta_1,\beta_2\), \(\epsilon\) a stability constant, \(\gamma\) the adaptivity exponent, \(v_k\) the harmonic-mean preconditioner refreshed only when \(k+1=\log_2 t\), \(V_k=\mathrm{diag}(v_k)\), and \(\Pi_{\mathcal{F},V_k^{-1}}\) the projection onto the feasible set \(\mathcal{F}\) in the \(V_k^{-1}\) metric.
Reference: Yiheng Zhang, Kaiyan Zhao, Shaowu Wu, Yiming Wang, Jiajun Wu, Leong Hou U, Steve Drew, Xiaoguang Niu, "Anon: Extrapolating Adaptivity Beyond SGD and Adam", ICML 2025. https://arxiv.org/abs/2605.02317