ProjFactor (VLoRP)¶
Implements ProjFactor, the optimizer for Various-Grained Low-Rank Projection (VLoRP) that keeps a projected first moment and an Adafactor-style factored second moment in the subspace.
VLoRP reshapes the gradient of a weight matrix \(W \in \mathbb{R}^{n \times m}\) into \([nc,\, m/c]\) and right-multiplies it by a random projection \(\tilde P \in \mathbb{R}^{(m/c) \times r}\) with entries drawn from \(\mathcal N(0, 1/r)\), trading off rank \(r\) against granularity \(c\) at a fixed compression budget. ProjFactor stores the momentum \(\tilde m^s\) in the low-rank subspace and tracks only the row and column sums of the squared back-projected gradient, so the optimizer state grows as \(O(n + m)\) rather than \(O(nm)\). At each step the momentum is projected back to the original space, divided by the rank-1 reconstruction of the second moment, reshaped, and applied to \(W\).
where \(\theta\) (the matrix \(W\)) are the parameters, \(\eta\) the learning rate, \(g_t\) the gradient, \(\tilde P\) the random down-projection of rank \(r\), \(\tilde m_t^s\) the first moment kept in the \(r\)-dimensional subspace, \(\tilde v_t^{ro}, \tilde v_t^{co}\) the row and column second-moment factors of the back-projected gradient \(\tilde G_t^s \tilde P^{\top}\), \(\hat v_t^o\) their rank-1 reconstruction, \(\beta_1, \beta_2\) the decay rates, \(\epsilon\) a stability constant, \(\odot 2\) elementwise squaring, \(\mathbf 1\) all-ones vectors, \(n, m\) the matrix dimensions, \(c\) the granularity, and \((1 - \beta_2^t)/(1 - \beta_1^t)\) the bias-correction scalar at step \(t\).
Reference: Yezhen Wang, Zhouhao Yang, Brian K. Chen, Fanyi Pu, Bo Li, Tianyu Gao, Kenji Kawaguchi, "Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients", ICML 2025. https://arxiv.org/abs/2505.01744