LoRDO¶
Implements LoRDO, distributed low-rank optimization with infrequent communication.
Each worker keeps Adam-style moments in a low-rank subspace defined by a projection \(Q_t\), so per-worker optimizer memory drops from \(O(pq)\) to \(O(r(p+q))\). Gradients are clipped and error-fed before projection, and workers communicate only every \(K_x\) steps, at which point the averaged update direction is used to recompute the shared projection. To stop the iterates from stalling inside a fixed low-rank subspace, a full-rank quasi-hyperbolic term mixes the raw (full-rank) gradient with the low-rank momentum, keeping the aggregated pseudo-gradient full-rank.
For worker \(m\) at step \(t\), the local update (full-rank quasi-hyperbolic variant) is
and every \(K_x\) steps the workers synchronize and refresh the projection:
where \(\theta\) are parameters, \(\eta_t\) the learning rate, \(\hat{G}^m_t\) the clipped full-rank gradient and \(\hat{g}^m_t\) its low-rank projection, \(E^m_t\) the projection error feedback, \(u^m_t,v^m_t\) the low-rank first/second moments with decays \(\beta_1,\beta_2\), \(\epsilon\) a stability constant, \(\rho\) the clipping threshold, \(\omega\) the quasi-hyperbolic mixing weight, \(\mu\) a scaling factor, \(Q_t\) the rank-\(r\) projection, and \(K_x\) the communication interval.
Reference: Andrej Jovanović, Alex Iacob, Mher Safaryan, Ionut-Vlad Modoranu, Lorenzo Sani, William F. Shen, Xinchi Qiu, Dan Alistarh, Nicholas D. Lane, "LoRDO: Distributed Low-Rank Optimization with Infrequent Communication", ICML 2026. https://arxiv.org/abs/2602.04396