AdaL¶
Implements AdaL, an Adam variant that transforms the gradient by its \(\ell_1\)-norm before accumulating moments.
AdaL keeps Adam's structure but applies an adaptive gradient transformation: each gradient is scaled by its own \(\ell_1\)-norm prior to the moment updates. This amplifies large gradients early in training to speed convergence and damps them near a minimum to improve generalization. The step size is annealed as \(\eta_t = \eta/\sqrt{t}\), and a small \(\epsilon\) is added inside the preconditioner rather than to the denominator.
where \(\theta\) are the parameters, \(\eta\) the base learning rate, \(g_t\) the gradient, \(\lVert g_t \rVert_1\) its \(\ell_1\)-norm, \(\hat{g}_t\) the transformed gradient, \(m_t, v_t\) the first and second moment estimates with decay rates \(\beta_1, \beta_2\), and \(\epsilon\) a stability constant.
Reference: Hongwei Zhang, Weidong Zou, Hongbo Zhao, Qi Ming, Tijin Yan, Yuanqing Xia, Weipeng Cao, "AdaL: Adaptive Gradient Transformation Contributes to Convergences and Generalizations", arXiv 2021. https://arxiv.org/abs/2107.01525