DECA¶
Implements DECA, a decentralized full-parameter fine-tuning method that runs block-wise Adam with consensus-corrected moment estimates.
DECA partitions the model into \(B\) disjoint blocks and, in each communication round, optimizes one block at a time with a server-free Adam variant. For an active block it runs an inner loop of \(R\) steps that combines two ingredients: standard Adam-style moments built from fresh local gradients, and block-wise moment approximations (BMAs) that fold in a consensus-derived discrepancy signal. After each local Adam step a client averages its block with its neighbors, and the resulting change — the difference between the local and aggregated model — is injected back into the first and second moments. This steers the next step toward network-wide agreement while keeping it aligned with the local objective, mitigating client drift on non-IID data.
For client \(i\) on active block \(k\), the inner step \(r\) (block/round indices dropped) is:
where \(\theta\) are the local block parameters, \(\gamma\) the learning rate, \(g_r\) the local block gradient on data \(D_i\), \(m\)/\(v\) the first- and second-order BMAs, \(\alpha_1,\alpha_2\) the local-gradient decay rates, \(\beta_1,\beta_2\) the consensus-signal decay rates, \(\epsilon\) the stability constant, \(N_i\) and \(w_{i,j}\) the neighborhood and mixing weights of client \(i\), and \(h_r\) the consensus-derived discrepancy signal between the local and aggregated block.
Reference: Yunsheng Yuan, Shaowei Li, Kai Wang, Zhongyuan Sun, Zheng Zhang, Kai Han, Jun Luo, Feng Li, "DECA: Decentralizing Block-Wise Adam for Efficient LLM Full-Parameter Fine-Tuning on Non-IID Data", arXiv 2026. https://arxiv.org/abs/2606.03209