DeCo-SGD¶
Implements DeCo-SGD, distributed SGD that jointly tunes gradient-compression ratio and update staleness for communication-efficient training.
DeCo-SGD builds on delayed-aggregation error-feedback SGD (DD-EF-SGD): each of the \(n\) workers compresses a stale gradient together with its accumulated compression residual, the residual is updated with the part that was dropped, and the global parameters move along the averaged compressed messages. A controller periodically re-selects the staleness \(\tau\) and the compression ratio \(\delta\) from monitored bandwidth and compute conditions, trading communication volume against convergence speed.
where \(\theta_t\) are the global parameters, \(\gamma\) is the learning rate, \(g_{t-\tau}^i\) is worker \(i\)'s gradient delayed by staleness \(\tau\), \(e_t^i\) is its error-feedback residual, \(C_\delta(\cdot)\) is a compression operator with ratio \(\delta \in (0,1]\), and \(\Delta_t^i\) is the compressed message averaged over the \(n\) workers.
Reference: Rongwei Lu, Jingyan Jiang, Chunyang Li, Haotian Dong, Xingguang Wei, Delin Cai, Zhi Wang, "DeCo-SGD: Joint Optimization of Delay Staleness and Gradient Compression Ratio for Distributed SGD", arXiv preprint 2025. https://arxiv.org/abs/2507.17346