Fractional-order SGD (FSGD)¶

Implements Fractional-order SGD (FSGD), gradient descent driven by a Riemann-Liouville fractional-order gradient of each layer.

The paper derives a closed-form fractional-order derivative of an affine layer \(y = xw + b\) and uses it, through a fractional-order autograd, to produce a fractional-order weight gradient. Replacing the ordinary gradient in gradient descent (and its variants, e.g. Adam) with this fractional-order gradient yields the corresponding fractional-order optimizers FSGD and FAdam. For a layer the per-weight fractional gradient is

\[ \begin{aligned} \frac{\partial^{\alpha} y}{\partial w^{\alpha}} &= \frac{x}{\Gamma(2-\alpha)}\,\lvert w\rvert^{\,1-\alpha} + \mathrm{sign}(w)\,\frac{b}{\Gamma(1-\alpha)}\,\lvert w\rvert^{-\alpha}, \\ g_t^{(\alpha)} &= \Big(\tfrac{\partial^{\alpha} \mathbf{Y}}{\partial \mathbf{W}^{\alpha}}\Big)^{\!\top}\!\bullet \mathbf{G}_t, \\ \theta_{t+1} &= \theta_t - \eta\, g_t^{(\alpha)}, \end{aligned} \]

where \(\alpha \in (0,1]\) is the fractional order, \(\Gamma\) is the gamma function, \(x\) and \(b\) are the layer input and bias, \(w\) a weight, \(\mathbf{G}_t\) the back-propagated upstream matrix, \(\bullet\) the elementwise-then-contracted product of the fractional-derivative matrix with \(\mathbf{G}_t\), \(g_t^{(\alpha)}\) the resulting fractional-order weight gradient, \(\eta\) the learning rate, and \(\theta\) the parameters; at \(\alpha = 1\) the rule reduces to ordinary SGD.

Reference: Xiaojun Zhou, Chunna Zhao, Yaqun Huang, Chengli Zhou, Junjie Ye, Kemeng Xiang, "Fractional-order Jacobian Matrix Differentiation and Its Application in Artificial Neural Networks", arXiv 2025. https://arxiv.org/abs/2506.07408

Back to the Canon