PadamP¶
Implements PadamP, partially adaptive Adam with projection-based decoupling for scale-invariant weights.
PadamP combines two ideas. From Padam it borrows partial adaptivity: the second-moment normalizer is raised to a power \(p \in (0, 1/2]\) rather than the usual \(1/2\), interpolating between SGD-with-momentum (\(p \to 0\)) and Adam (\(p = 1/2\)) to curb the generalization gap of fully adaptive methods. From AdamP it borrows a projection step that, when the parameter and gradient are nearly orthogonal, removes the component of the update lying along the parameter direction, preventing the effective step size from collapsing on scale-invariant weights.
where \(g_t = \nabla_\theta f_t(\theta_t)\), the projection \(\Pi_{\theta_t}(p_t) = p_t - \langle \hat{\theta}_t, p_t \rangle\, \hat{\theta}_t\) removes the radial component with \(\hat{\theta}_t = \theta_t / \lVert \theta_t \rVert_2\), \(\eta_t\) is the learning rate, \(\beta_1, \beta_2\) are the moment decay rates, \(p \in (0, 1/2]\) is the partial-adaptivity exponent, \(\delta\) is the projection threshold (e.g. \(0.1\)), and \(\epsilon\) is a stability constant.
Reference: Yongqi Li, Xiaowei Zhang, "Adaptive Moment Estimation Optimization Algorithm Using Projection Gradient for Deep Learning", arXiv 2025. https://arxiv.org/abs/2503.10005