FAdam¶
Implements FAdam (Fisher Adam), recasting Adam as natural gradient descent with a diagonal empirical Fisher information matrix.
This is Fisher Adam, not a fractional Adam variant: the name FAdam refers to the Fisher information interpretation, where the second-moment buffer is read as a diagonal empirical Fisher and the update is a natural gradient step.
The buffer \(f_t\) accumulates the squared gradient as a diagonal empirical Fisher. The gradient is divided by \(f_t^{p}\) to form the natural gradient \(\bar{g}_t\) (with \(p = 1/2\) recovering the Adam denominator), both the natural gradient and the weight-decay term are root-mean-square clipped to a maximum norm \(c\), and momentum is applied to the clipped natural gradient. The decoupled weight decay \(\lambda\) is itself preconditioned by the Fisher.
Note: following the official implementation, the Fisher EMA uses a debiased
decay \(\hat{\beta}_2 = \beta_2 (1 - \beta_2^{t-1}) / (1 - \beta_2^{t})\) in place of \(\beta_2\), and the stability constant is scaled by the gradient RMS, \(\epsilon_t = \min(\mathrm{RMS}(g_t), 1)\,\epsilon\), so the denominator is \(f_t^p + \epsilon_t\).
Reference: Dongseong Hwang, "FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information", 2024. https://arxiv.org/abs/2405.12807