LD-SGD¶
Implements LD-SGD, local decentralized SGD that interleaves several local SGD steps with periodic gossip averaging.
In decentralized training, \(n\) workers each hold a local model and a private data shard, and they communicate only with neighbors through a mixing matrix instead of a central server. Fully decentralized SGD gossips after every gradient step, which is communication-heavy. LD-SGD reduces this cost by letting each worker run a block of local SGD updates and mixing the models only at the iterations in a communication set \(\mathcal{I}_T\). Choosing \(\mathcal{I}_T\) trades communication for accuracy: mixing every step recovers decentralized SGD, while sparser mixing approaches independent local training.
Stacking the workers' parameters into columns of \(X_t \in \mathbb{R}^{d\times n}\) and their stochastic gradients into \(G(X_t;\xi_t)\), one iteration is
where \(x_t^{(k)}\) is worker \(k\)'s model, \(\eta\) the learning rate, \(\nabla F_k(\cdot;\xi^{(k)})\) its local stochastic gradient, \(\mathcal{N}_k\) the neighbors of \(k\), and \(W=[w_{kl}]\) a symmetric doubly stochastic mixing matrix (\(W=W^\top\), \(W\mathbf{1}_n=\mathbf{1}_n\), \(w_{kl}\ge 0\)). At local iterations (\(t \notin \mathcal{I}_T\)) \(W_t=I_n\), so no communication occurs and each worker takes a plain SGD step; at communication iterations (\(t \in \mathcal{I}_T\)) the half-step models are mixed through \(W\).
Reference: Xiang Li, Wenhao Yang, Shusen Wang, Zhihua Zhang, "Communication-Efficient Local Decentralized SGD Methods", arXiv 2019. https://arxiv.org/abs/1910.09126