Adai¶
Implements Adai (Adaptive Inertia), which disentangles the adaptive learning rate of Adam into a parameter-wise adaptive momentum.
Unlike Adam, the adaptive second moment is not used to scale the step size directly. Instead it modulates a parameter-wise inertia (momentum) factor \(\beta_{1,t}\): parameters whose bias-corrected second moment \(\hat{v}_t\) is large relative to the mean \(\bar{v}_t\) over all parameters receive a smaller momentum, while parameters with small second moment are driven by heavier inertia. The first moment uses a per-parameter cumulative product of the inertia factors for bias correction.
The dampening argument generalizes the rule: with \(d\) the dampening,
the inertia exponent becomes \(1 / (3 - 2 d)\), the gradient is scaled by
\((1 - \beta_{1,t})^d\), and the update is rescaled by
\(\beta_0^{1 - d}\). The default \(d = 1\) recovers the published
Adai update.
Reference: Zeke Xie, Xinrui Wang, Huishuai Zhang, Issei Sato, Masashi Sugiyama, "Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momentum", ICML 2022. https://arxiv.org/abs/2006.15815