LAGS-SGD¶
Implements LAGS-SGD, layer-wise adaptive gradient sparsification for communication-efficient distributed SGD.
In distributed data-parallel training, exchanging full gradients dominates the communication cost. LAGS-SGD sparsifies the gradient of each layer independently with a per-layer Top-\(k\) operator, so layers with different sizes and communication-to-computation ratios use their own compression level, and the selection of a layer can be sent as soon as its backpropagation finishes (overlapping communication with computation).
To preserve accuracy under aggressive sparsification, each worker keeps a local residual (error feedback): the entries dropped by Top-\(k\) are accumulated and re-injected in later iterations. The server averages the sparse contributions from all \(P\) workers to update the parameters.
where \(l\) indexes layers and \(p\) indexes the \(P\) workers, \(\theta^{(l)}\) are the layer-\(l\) parameters, \(\eta_t\) is the learning rate, \(g_t^{p,(l)}\) is worker \(p\)'s stochastic gradient for layer \(l\), \(\epsilon_t^{p,(l)}\) is its accumulated residual, \(a_t^{p,(l)}\) is the locally accumulated gradient, and \(\mathrm{TopK}(x, k^{(l)})\) keeps the \(k^{(l)}\) largest-magnitude entries of \(x\) and zeros the rest.
Reference: Shaohuai Shi, Zhenheng Tang, Qiang Wang, Kaiyong Zhao, Xiaowen Chu, "Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees", arXiv 2019. https://arxiv.org/abs/1911.08727