PowerStep¶
Implements PowerStep, a memory-efficient optimizer that applies a signed power transform to a heavy-ball momentum buffer.
PowerStep avoids the per-coordinate second-moment buffer of Adam by deriving coordinate-wise adaptivity from an \(\ell_p\)-norm steepest-descent view. It first accumulates gradients into a heavy-ball momentum buffer for temporal smoothing, then applies a signed power transform to that buffer, which compresses large coordinates and amplifies small ones. This yields adaptive-style behavior with a single state buffer, halving the optimizer memory relative to Adam, while decoupled weight decay is added directly in the update.
where \(\theta\) are the parameters, \(g_t\) the gradient, \(m_t\) the heavy-ball momentum buffer, \(\gamma \in [0,1)\) the momentum coefficient, \(\beta \in [0,1]\) the power exponent applied elementwise, \(\eta_t\) the learning-rate schedule, \(\lambda \ge 0\) the decoupled weight decay, and \(\odot\) elementwise multiplication.
Reference: Yao Lu, Dengdong Fan, Shixun Zhang, Yonghong Tian, "PowerStep: Memory-Efficient Adaptive Optimization via \(\ell_p\)-Norm Steepest Descent", arXiv 2026. https://arxiv.org/abs/2605.10335