GASLoC¶
Implements GASLoC, a decentralized low-communication method that fuses gossip averaging with the outer optimizer of local-update training.
GASLoC runs each worker \(i\) through \(H_i\) local inner steps, then forms a pseudo-gradient \(g_t^i\) equal to the drift of its parameters over the inner phase. Instead of a global all-reduce, workers exchange parameters only with sparse neighbors in a communication graph: the outer step applies the proposed update \(\theta_t + \eta g_t\) and then subtracts a gossip term built from the weighted graph Laplacian \(\Lambda\), so each worker is pulled toward its peers. This generalizes the DiLoCo outer optimizer to arbitrary topologies and supports randomized one- or two-peer matchings.
An accelerated variant adds a momentum term on the post-local-update iterates, which provably improves the dependence on the spectral gap \(\chi\) of the gossip matrix:
where \(\theta_t^i\) are worker \(i\)'s parameters at outer step \(t\), \(g_t^i\) is its pseudo-gradient (the inner-phase drift), \(\beta\) is the inner learning rate, \(\eta\) is the outer learning rate, \(H_i\) is the number of local steps on worker \(i\), \(\Lambda = \tfrac{1}{2}\sum_{(i,j)\in\mathcal{E}}\lambda_{ij}(e_i-e_j)(e_i-e_j)^\top\) is the weighted graph Laplacian of the communication graph, \(\alpha\) is the gossip step size, and \(\gamma\) is the momentum parameter applied to the post-update iterates (\(\gamma = 0\) recovers the non-accelerated form). Stacked over workers, \(\theta_t\) and \(g_t\) collect all \(\theta_t^i\) and \(g_t^i\).
Reference: Pietro Cagnasso, Eugene Belilovsky, Edouard Oyallon, "Unifying Local Communications and Local Updates for LLM Pretraining", arXiv 2026. https://arxiv.org/abs/2606.11081