LaProp¶
Implements LaProp, which separates momentum from adaptivity in Adam.
LaProp divides the gradient by the second-moment estimate before accumulating momentum, so the momentum buffer holds already-normalized steps rather than raw gradients.
\[
\begin{aligned}
v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\
m_t &= \beta_1 m_{t-1} + (1 - \beta_1)
\frac{g_t}{\sqrt{v_t / (1 - \beta_2^t)} + \epsilon} \\
\theta_t &= \theta_{t-1} - \frac{\eta}{1 - \beta_1^t} \, m_t
\end{aligned}
\]
The bias-correction terms are tracked as exponential moving averages so that a learning rate that changes across steps is handled correctly.
Reference: Liu Ziyin, Zhikang T. Wang, Masahito Ueda, "LaProp: Separating Momentum and Adaptivity in Adam", arXiv 2020. https://arxiv.org/abs/2002.04839