Ranger¶
Implements Ranger, RAdam with a Lookahead wrapper and gradient centralization.
Each step takes a rectified Adam (RAdam) update on the fast weights and, every \(k\) steps, interpolates a set of slow weights toward them (Lookahead). Gradients are optionally centralized by subtracting their mean before the moment updates.
When the length of the approximated simple moving average satisfies
\(\rho_t \geq\) n_sma_threshold (default 5), the variance is
tractable and the step is rectified:
Otherwise, with degenerated_to_sgd=True, the update falls back to the
unscaled first moment,
\(\theta_t = \theta_{t-1} - \frac{\eta}{1 - \beta_1^t} m_t\); with the
default degenerated_to_sgd=False the rectified branch is simply skipped
until the moving average becomes tractable. Every
\(k\) steps the slow weights \(\phi\) track the fast weights,
\(\phi_t = \phi_{t-1} + \alpha (\theta_t - \phi_{t-1})\), and the fast
weights are reset to \(\phi_t\).
Here \(\theta\) are the parameters, \(\eta\) is the learning rate, \(g_t\) is the gradient, \(m_t\) and \(v_t\) are the first and second moments, \(\beta_1, \beta_2\) are their decay rates, \(k\) is the Lookahead synchronization period, and \(\alpha\) is the Lookahead interpolation factor.
Reference: Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, Jiawei Han, "On the Variance of the Adaptive Learning Rate and Beyond", ICLR 2020. https://arxiv.org/abs/1908.03265 Reference: Michael R. Zhang, James Lucas, Geoffrey Hinton, Jimmy Ba, "Lookahead Optimizer: k steps forward, 1 step back", NeurIPS 2019. https://arxiv.org/abs/1907.08610 Reference: Hongwei Yong, Jianqiang Huang, Xiansheng Hua, Lei Zhang, "Gradient Centralization: A New Optimization Technique for Deep Neural Networks", ECCV 2020. https://arxiv.org/abs/2004.01461