AdaNorm¶
Implements AdaNorm, an Adam variant with adaptive gradient norm correction.
AdaNorm tracks an exponential moving average of the gradient norm and, when the norm of the current gradient falls below that average, rescales the gradient up to the running norm before it enters the first moment. This keeps the first moment driven by a high and representative gradient magnitude throughout training, while the second moment continues to use the raw gradient.
where \(\theta\) are the parameters, \(\eta\) is the learning rate, \(g_t\) is the gradient, \(s_t\) is the running gradient norm with decay \(r\), \(m_t\) and \(v_t\) are the first and second moments, and \(\beta_1, \beta_2\) are their decay rates.
Reference: Shiv Ram Dubey, Satish Kumar Singh, Bidyut Baran Chaudhuri, "AdaNorm: Adaptive Gradient Norm Correction based Optimizer for CNNs", WACV 2023. https://arxiv.org/abs/2210.06364