Clip21-SGD2M¶
Implements Clip21-SGD2M, distributed clipped SGD with double momentum and error feedback for differential privacy.
Naive per-step gradient clipping introduces a bias that prevents convergence to the exact solution. Clip21 removes this bias by clipping the difference between a momentum-smoothed gradient estimate and a running buffer, then feeding the clipped increment back into that buffer (error feedback). Clip21-SGD2M layers two momentum mechanisms on top: a client-side momentum \(\beta_1\) that smooths the stochastic gradient into \(v_t\), and a server-side momentum \(\beta_2\) that controls how fast the error-feedback buffer \(g_t\) tracks \(v_t\). Optional Gaussian noise \(\omega_t\) added to each clipped increment yields the differentially private variant.
For each worker \(i\) (gradients drawn from local data \(f_i\)), the per-iteration update is
where \(\theta_t\) are the model parameters, \(\gamma > 0\) the stepsize, \(\beta_1,\beta_2 \in (0,1]\) the client- and server-side momentum coefficients, \(g_t\) the aggregated error-feedback gradient estimate (with per-worker buffers \(g_{i,t}\), initialized \(g_0 = v_0 = 0\)), \(v_{i,t}\) the momentum-smoothed local gradient, \(\mathrm{clip}_\tau(x) = \min\{1, \tau/\lVert x\rVert\}\, x\) the clipping operator with threshold \(\tau > 0\), \(\omega_{i,t} \sim \mathcal{N}(0, \sigma_\omega^2 I)\) the optional DP noise (\(\sigma_\omega = 0\) in the non-private case), and \(n\) the number of workers.
Reference: Rustem Islamov, Samuel Horváth, Aurelien Lucchi, Peter Richtárik, Eduard Gorbunov, "Double Momentum and Error Feedback for Clipping with Fast Rates and Differential Privacy", arXiv 2025. https://arxiv.org/abs/2502.11682