LoRA-Pre¶
Implements LoRA-Pre, a memory-efficient Adam variant that stores momentum as a low-rank product.
The key observation is that an exponential moving average of the gradient is equivalent to fitting an online linear regressor, so the momentum matrix can be carried in a compact factorized form \(m_B m_A\) instead of storing the dense moment. Each step updates the low-rank factors with one regression step, reconstructs the moment, and then applies a standard Adam update with bias correction and decoupled weight decay. The same factorization is applied to the (elementwise-squared) second moment \(v_B v_A\).
where \(\theta\) are the parameters, \(\gamma\) the learning rate, \(g_t\) the gradient, \(m_{B},m_{A}\) and \(v_{B},v_{A}\) the rank-\(r\) factors of the first and second moments, \(\beta_1,\beta_2\) the Adam decay rates, \(\gamma_1,\gamma_2\) the regressor (EMA) rates chosen so that \(1-\gamma_1=\sqrt{\beta_1}\) and \(1-\gamma_2=\beta_2^{1/4}\), \(\lambda\) the weight decay, \(\epsilon\) a stability constant, and \(\odot 2\) elementwise squaring.
Reference: Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, Tieniu Tan, "Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation", arXiv 2025. https://arxiv.org/abs/2602.24283