PowerSGD¶
Implements PowerSGD, low-rank gradient compression via one step of power iteration for communication-efficient distributed SGD.
PowerSGD reshapes each gradient tensor into a matrix \(M\) and approximates it by a rank-\(r\) factorization \(\hat{M} = P Q^\top\), computed with a single subspace-iteration step that reuses the right factor \(Q\) across optimizer steps as a warm start. Only the low-rank factors are exchanged: \(P\) is aggregated across workers with all-reduce, then orthogonalized, and \(Q\) is recomputed and all-reduced. To stay unbiased over time, the compression residual is kept in an error buffer \(e_t\) that is added back to the gradient before the next compression.
where \(g_t\) is the local gradient reshaped to a matrix, \(e_t\) the error-feedback buffer, \(Q\) the right factor carried over between steps, \(\hat{P} Q^\top\) the decompressed (all-reduced) gradient, \(\lambda\) the momentum factor, \(\gamma\) the learning rate, and \(\theta\) the parameters.
Reference: Thijs Vogels, Sai Praneeth Karimireddy, Martin Jaggi, "PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization", NeurIPS 2019. https://arxiv.org/abs/1905.13727