AGD¶
Implements AGD, an auto-switchable optimizer that builds its preconditioner from the stepwise difference of bias-corrected gradient moments.
AGD forms the preconditioner from \(s_t\), the difference between consecutive bias-corrected first moments, rather than from the raw gradient. A \(\max\) in the denominator gates the per-coordinate behavior: where the accumulated squared difference \(b_t\) is large the step is adaptive (Adam-like), and where it falls below the threshold set by \(\delta\) the update reduces to scaled momentum (SGD-like), so the optimizer switches automatically between the two regimes per coordinate.
where \(\theta\) are the parameters, \(\gamma\) the learning rate, \(g_t\) the gradient, \(m_t\) the first moment, \(s_t\) the stepwise difference of bias-corrected moments (with \(s_1 = m_1 / (1 - \beta_1)\)), \(b_t\) the second moment of \(s_t\), \(\beta_1,\beta_2\) the decay rates, and \(\delta\) the threshold controlling the SGD-to-adaptive switch.
Reference: Yun Yue, Zhiling Ye, Jiadi Jiang, Yongchao Liu, Ke Zhang, "AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix", NeurIPS 2023. https://arxiv.org/abs/2312.01658