Ranger21¶
Implements Ranger21, a synergistic combination of AdamW and eight techniques.
Ranger21 keeps an AdamW core and layers on adaptive gradient clipping,
gradient centralization, gradient normalization, positive-negative
momentum, norm loss, stable weight decay, a linear warmup combined with an
explore-exploit warmdown schedule, Lookahead, and a softplus-smoothed
denominator. The positive-negative momentum keeps two first-moment buffers,
one for odd and one for even steps, and forms the update direction as a
positively weighted current moment minus a negatively weighted previous
moment, normalized so the learning rate need not change with beta0:
The learning rate \(\eta_t\) follows the explore-exploit schedule, a
linear warmup over the first \(t_{\text{warmup}}\) steps, a flat phase,
and a linear warmdown over the last \(t_{\text{warmdown}}\) steps, which
is why num_iterations (the total number of training steps) is required.
Note: Following the reference implementation, the positive-negative momentum combination fixes the coefficients to \(\beta_0 = 1\) (so the update is \(2 m_t - m_{t-1}\)) and normalizes by \(\sqrt{(1 + \beta_2)^2 + \beta_2^2}\); the beta0 argument is retained only for the noise-amplitude validation range.
Reference: Less Wright, Nestor Demeure, "Ranger21: a synergistic deep learning optimizer", arXiv 2021. https://arxiv.org/abs/2106.13731