AdaSmooth¶
Implements AdaSmooth, an adaptive learning rate method based on the effective ratio.
AdaSmooth replaces the fixed decay of the squared-gradient running average with a per-parameter smoothing constant derived from the effective ratio of the recent parameter trajectory. The effective ratio measures how directed the movement has been: it is the magnitude of the accumulated change divided by the accumulated absolute change. A directed trajectory (ratio near one) yields a short averaging window, which speeds up the descent, while a zigzagging trajectory (ratio near zero) yields a long window, which slows the descent near a minimum.
where \(\theta\) are the parameters, \(g_t\) is the gradient,
\(s_t\) and \(n_t\) accumulate the signed and absolute parameter
changes, \(e_t\) is the effective ratio, \(c_t\) is the scaled
smoothing constant built from the fast and slow decay rates
\(\rho_1, \rho_2\) (passed as betas), \(v_t\) is the running
average of the squared gradient, \(\eta\) is the learning rate, and
\(\epsilon\) guards the denominator.
Reference: Jun Lu, "AdaSmooth: An Adaptive Learning Rate Method based on Effective Ratio", arXiv 2022. https://arxiv.org/abs/2204.00825