rTop-k¶
Implements rTop-k, a gradient sparsifier for distributed SGD that combines top-r selection with uniform random sampling.
In communication-efficient distributed training each worker compresses its gradient before transmission. Pure top-\(k\) is deterministic and biased toward large coordinates, while random-\(k\) is unbiased but ignores magnitude. rTop-k interpolates between the two: it first restricts attention to the \(r\) coordinates of largest magnitude, then keeps a uniformly random \(k\)-subset of those, zeroing the rest. With error feedback, the coordinates dropped at one step are accumulated into a memory term and reintroduced at the next, so no gradient information is permanently discarded.
where \(g_i^t\) is the local stochastic gradient at worker \(i\), \(m_i^t\) is its accumulated error-feedback memory, \(\theta\) are the model parameters, \(\eta\) is the learning rate, and \(n\) is the number of workers. The mask \(U\) is drawn uniformly from the \(k\)-element subsets of the indices of the \(r\) largest-magnitude coordinates of \(p_i^t\) (with \(1 \le k \le r \le d\)), so \(U\) is a random \(k\)-subset of the top-\(r\) support.
Reference: Leighton Pate Barnes, Huseyin A. Inan, Berivan Isik, Ayfer Ozgur, "rTop-k: A Statistical Estimation Approach to Distributed SGD", arXiv 2020. https://arxiv.org/abs/2005.10761