PAdam¶
Implements PAdam, partially adaptive momentum estimation.
PAdam interpolates between Adam and SGD with momentum by raising the second-moment denominator to a partial power \(p \in (0, 1/2]\). With \(p = 1/2\) the update is Adam; as \(p \to 0\) the adaptivity vanishes and the update approaches plain momentum, which lets PAdam use a larger base learning rate without the gradient explosion that small denominators cause.
This implementation applies Adam-style bias correction to the moments and raises the bias-corrected denominator to the partial power, so the effective step is \(\eta \, \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)^{2p}\), which equals \(\hat{v}_t^{\,p}\) up to the stabilizing \(\epsilon\).
Reference: Jinghui Chen, Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, Quanquan Gu, "Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks", IJCAI 2020. https://arxiv.org/abs/1806.06763