Ano¶
Implements Ano, a momentum optimizer that decouples step direction from step magnitude.
The direction of each step is taken from the sign of the momentum \(m_t\), which aggregates gradient history for stability, while the magnitude is taken from the instantaneous gradient \(|g_t|\), so the optimizer reacts quickly in noisy landscapes instead of being damped by stale momentum. The denominator uses a Yogi-style second moment \(v_t\) that decreases additively, and a decoupled weight decay term is applied directly to the parameters. The variant Anolog replaces the fixed \(\beta_1\) with the schedule \(\beta_{1,t} = 1 - 1/\log(t+2)\).
where \(\theta\) are the parameters, \(\eta\) is the base learning rate with decaying schedule \(\eta_t\), \(g_t\) is the gradient, \(m_t\) and \(v_t\) are the first and second moments, \(\beta_1, \beta_2\) are the decay rates, \(\lambda\) is the weight decay, and \(\epsilon\) is a stability constant.
Reference: Adrien Kegreisz, "Ano: Faster is Better in Noisy Landscapes", arXiv 2025. https://arxiv.org/abs/2508.18258