FGD-ED¶

Implements FGD-ED, a power-function fractional-order gradient descent that decays the fractional-order term exponentially during training.

Fractional-order gradient descent replaces the integer gradient with a Caputo fractional derivative of the loss, evaluated with the previous iterate as the lower limit. For the power-function form this yields the ordinary gradient scaled by a Gamma-function ratio and by the historical step magnitude raised to \(1-\alpha\), which injects a power-law memory of past updates. In practice this memory term causes severe oscillation and gradient explosion late in training.

FGD-ED refines the differentiation formula and multiplies the fractional-order term by an exponentially decaying coefficient \(e^{-\kappa t}\), so the update behaves fractionally early on (fast initial convergence) and relaxes toward the ordinary gradient step as training proceeds (oscillation suppression):

\[ \begin{aligned} \theta_{t+1} &= \theta_t - \gamma\, \frac{g_t}{\Gamma(2-\alpha)}\, \rho_t\, |\theta_t - \theta_{t-1}|^{\,1-\alpha} \\ \rho_t &= e^{-\kappa t} \end{aligned} \]

where \(\theta\) are the parameters, \(\gamma\) the learning rate, \(g_t\) the gradient, \(\alpha \in (0,1)\) the fractional order, \(\Gamma\) the gamma function, \(\rho_t\) the exponential decay coefficient with rate \(\kappa > 0\), and \(|\theta_t - \theta_{t-1}|\) the magnitude of the previous step (the Caputo lower-limit offset). As \(\rho_t \to 0\) the power-law memory term vanishes and the rule reduces to ordinary gradient descent.

Reference: Xiaojun Zhou, et al., "Fractional-order gradient descent method based on fractional-order term exponential decay and its application in artificial neural networks", Information Processing & Management 2026. https://doi.org/10.1016/j.ipm.2025.104448

Back to the Canon