Fractional Derivative Gradient Optimizers (FSGD¶
Implements Fractional Derivative Gradient Optimizers (FSGD, FAdam, ...), gradient descent with the gradient replaced by a Caputo fractional-order derivative.
The method replaces the first-order weight derivative in any base optimizer with the Caputo fractional derivative of order \(\nu\). Using the power rule \(D^\nu x^p = \frac{\Gamma(p+1)}{\Gamma(p-\nu+1)} x^{p-\nu}\) at \(p=1\), the chain rule applied to the loss leaves the ordinary backpropagated gradient \(g_t\) intact but multiplies it by a per-parameter fractional factor \(f^\nu\). So in practice each existing optimizer (SGD, Adam, Adagrad, ...) is turned into its fractional version by scaling the current gradient by \(f^\nu\) before the usual update.
To keep the factor real and finite, the raw parameter is replaced by \(|\theta| + \epsilon\): this avoids complex values when \(1-\nu\) has an even denominator, avoids division by zero when \(\nu > 1\), and gives a well-defined limit \(f^\nu \to 1\) as \(\nu \to 1\), so the rule reduces exactly to ordinary (integer-order) gradient descent.
where \(\theta\) are the parameters, \(\eta\) the learning rate, \(g_t\) the backpropagated gradient, \(\nu > 0\) the fractional derivative order, \(\Gamma\) the Euler gamma function, and \(\epsilon > 0\) a small constant ensuring real, finite values. For Adam and the other adaptive optimizers the same factor \(f^\nu\) scales the gradient \(g_t\) that feeds their moment estimates, yielding FAdam, FAdagrad, FAdadelta, FRMSProp, FSGDP, and FAdamP.
Reference: Oscar Herrera-Alcántara, "Fractional Derivative Gradient-Based Optimizers for Neural Networks and Human Activity Recognition", Applied Sciences 12(18):9264, 2022. https://doi.org/10.3390/app12189264