FAdam¶

Implements FAdam, an Adam variant whose momentum is driven by a Grünwald–Letnikov fractional-order difference of the gradient.

The method replaces the integer-order difference that underlies the first moment in Adam with a fractional-order difference of order \(\alpha\). The Grünwald–Letnikov difference carries a weighted memory of past gradients through its binomial coefficients, so each update blends the current gradient with a fading history of previous ones. A Caputo-based strategy is used to guarantee convergence to the true optimum, and the order \(\alpha\) is adjusted dynamically across iterations, which mimics tuning the momentum coefficient. An analogous construction applied to Adagrad yields FAdagrad.

\[ \begin{aligned} \Delta^{\alpha} g_t &= \sum_{k=0}^{t} (-1)^k \binom{\alpha}{k}\, g_{t-k}, \qquad \binom{\alpha}{k} = \frac{\Gamma(\alpha+1)}{\Gamma(k+1)\,\Gamma(\alpha-k+1)} \\ m_t &= \beta_1\, m_{t-1} + (1-\beta_1)\, \Delta^{\alpha} g_t \\ v_t &= \beta_2\, v_{t-1} + (1-\beta_2)\, g_t^2 \\ \hat{m}_t &= \frac{m_t}{1-\beta_1^{\,t}}, \qquad \hat{v}_t = \frac{v_t}{1-\beta_2^{\,t}} \\ \theta_{t+1} &= \theta_t - \eta\, \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \end{aligned} \]

where \(\theta\) are the parameters, \(\eta\) the learning rate, \(g_t = \nabla_\theta f(\theta_t)\) the gradient, \(\Delta^{\alpha} g_t\) the Grünwald–Letnikov fractional-order difference of order \(\alpha \in (0,1]\), \(\Gamma\) the gamma function, \(m_t,v_t\) the first and second moments with decay rates \(\beta_1,\beta_2\), and \(\epsilon\) a small constant for numerical stability.

Reference: Haiming Zhao, Honggang Yang, Jiejie Chen, Ping Jiang, Zhigang Zeng, "Parameter training methods for convolutional neural networks with adaptive adjustment method based on Caputo fractional-order differences", Chaos, Solitons & Fractals 2025. https://doi.org/10.1016/j.chaos.2025.116588

Back to the Canon