Momo¶
Implements MoMo, SGD with momentum and an adaptive Polyak step size.
where \(f_t\) is the loss, \(f_*\) is the lower bound lb on
the loss, and lr sets the cap \(\eta\) on the adaptive step size.
With bias_correction=True the averages start at zero and \(f_*\)
and \(\eta\) are rescaled by \(\rho_t = 1 - \beta^t\); with
weight_decay \(\lambda > 0\) the update ends with a proximal
division by \(1 + \eta\lambda\). use_fstar=True estimates the
lower bound online instead of keeping it fixed.
Note: step needs the current loss value: pass either a closure or, if the backward pass already ran, the loss tensor through loss.
Reference: Fabian Schaipp, Ruben Ohana, Michael Eickenberg, Aaron Defazio, Robert M. Gower, "MoMo: Momentum Models for Adaptive Learning Rates", ICML 2024. https://arxiv.org/abs/2305.07583