Fractional Gradient Descent (FSGD)¶

Implements Fractional Gradient Descent (FSGD), a Caputo-derivative reformulation of standard optimizers that replaces the first-order gradient with a fractional-order one.

The idea is to take the Caputo fractional derivative of the loss with respect to each parameter rather than the ordinary derivative. For the per-weight backpropagation term this yields the integer gradient scaled by a closed-form fractional factor that depends only on the parameter's own magnitude, the fractional order \(\nu\), and the Gamma function. The factor introduces a single extra hyperparameter \(\nu\) and reduces to the classical update as \(\nu \to 1\), so any existing optimizer (SGD, Adam, etc.) is "fractionalized" by multiplying its gradient element-wise by this factor. The authors implement this in PyTorch as drop-in "F"-prefixed optimizers and evaluate it on GAN and BERT training.

\[ \begin{aligned} f_\nu(\theta_t) &= \frac{\left(|\theta_t| + \epsilon\right)^{1-\nu}}{\Gamma(2-\nu)} \\ \theta_{t+1} &= \theta_t - \eta \, g_t \, f_\nu(\theta_t) \end{aligned} \]

where \(\theta_t\) are the parameters, \(g_t = \nabla_\theta E\) the ordinary gradient, \(\eta\) the learning rate, \(\nu\) the fractional order (with \(\nu = 1\) recovering classical gradient descent), \(\epsilon > 0\) a small constant avoiding indetermination at \(|\theta_t| = 0\), and \(\Gamma\) the Gamma function. In momentum or adaptive variants the fractional factor \(f_\nu\) multiplies the gradient before it enters the moment accumulators.

Reference: Oscar Herrera-Alcantara, Josue R. Cervantes-Alonso, "Fractional Gradient Optimizers for PyTorch: Enhancing GAN and BERT", Fractal and Fractional 2023, 7(7), 500. https://doi.org/10.3390/fractalfract7070500

Back to the Canon