DP-FedAdamW¶
Implements DP-FedAdamW, a differentially private AdamW variant for federated training of large models.
Each client clips per-sample gradients to norm \(C\), averages them over the local batch, and injects Gaussian noise to enforce \((\epsilon,\delta)\)-differential privacy. The noise inflates the second-moment estimate, so DP-FedAdamW debiases \(\hat v_t\) by subtracting the known noise variance before forming the adaptive denominator. A local-global alignment term \(\gamma\,\Delta G_t\) nudges each client toward the aggregated global descent direction, and weight decay is decoupled in the AdamW style.
where \(g_{ij}\) is the per-sample gradient, \(C\) the clipping norm, \(\sigma\) the noise multiplier, \(sR\) the local batch size, \(\mathcal{N}\) Gaussian noise, \(\beta_1,\beta_2\) the moment decays, \(\epsilon\) the stability constant, \(\lambda\) the decoupled weight decay, \(\eta\) the learning rate, \(\gamma\) the alignment weight, and \(\Delta G_t = -\tfrac{1}{SK\eta}\sum_i(\theta_i^{t,k}-\theta_i^{t,0})\) the empirical global descent estimate.
Reference: Jin Liu, Yinbin Miao, Ning Xi, Junkang Liu, "DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models", arXiv 2026. https://arxiv.org/abs/2602.19945