TP-TopK¶
Implements TP-TopK, a two-phase coordinate-sparse DP-SGD that restricts both optimization and noise injection to a learned coordinate support.
Standard DP-SGD injects isotropic Gaussian noise into every coordinate, so the optimizer-facing noise energy grows with the ambient dimension \(d\). TP-TopK first runs a short full-parameter DP warm-up that scores each coordinate by its noise-corrected squared gradient signal, selects the top \(k = \lfloor \rho d \rfloor\) coordinates as the active support \(A\), and then runs masked DP-SGD restricted to \(A\). Because masking precedes clipping, per-example sensitivity stays at \(C_2\) regardless of \(k\), and the effective noise dimension drops from \(d\) to \(k \ll d\).
In Phase 1, per-example gradients are clipped to \(C_1\), averaged with Gaussian noise, used to update \(\theta\), and accumulated into a coordinate score that is denoised by subtracting the per-coordinate noise floor; the support is then chosen by Top-\(K\). In Phase 2, each per-example gradient is masked to \(A\), clipped to \(C_2\), averaged with noise injected only on the active coordinates, and applied to \(\theta\).
where \(\theta\) are the parameters, \(\eta_t\) the learning rate, \(g_{t,i}\) the per-example gradient, \(C_1,C_2\) the \(\ell_2\) clip norms, \(\sigma_1,\sigma_2\) the noise multipliers, \(B\) the batch size, \(\rho\) the active ratio, \(a\) the noise-corrected coordinate score, \(A\) the active support of size \(k\), \(m\in\{0,1\}^d\) the support mask, and \(\odot\) elementwise product. Phase 1 runs for \(T_1\) steps to build the score; Phase 2 runs for \(T_2\) steps with the support held fixed.
Reference: Huiqi Zhang, Fang Xie, "When Do Fewer Coordinates Suffice in DP-SGD?", arXiv 2026. https://arxiv.org/abs/2606.04375