DP-Muon¶
Implements DP-Muon, a differentially private variant of Muon that clips and noises per-sample matrix gradients before orthogonalizing the momentum.
DP-Muon adapts the Gaussian mechanism to Muon's matrix-aware update. For each weight matrix \(W\) it clips every per-sample gradient in Frobenius norm to a threshold \(C_W\), averages the batch, and adds Gaussian noise calibrated to \(C_W\), giving a differentially private stochastic gradient. This noisy gradient feeds a heavy-ball momentum buffer, and the momentum matrix is then orthogonalized by a Newton–Schulz iteration (after orienting it to have no more rows than columns) so that the applied step has near-orthogonal singular structure.
where \(g_t(z)\) is the per-sample gradient of the loss \(\ell\), \(C_W\) the Frobenius clipping threshold, \(B\) the batch size, \(\sigma\) the noise multiplier, \(\beta\) the momentum decay, \(\mathcal{T}\) the operator that transposes \(W\) iff it has more rows than columns (so \(\mathcal{T}(W)\) has fewer rows than columns) with inverse \(\mathcal{T}^{-1}\), \(p_\kappa(\lambda)=\sum_{s=0}^{\kappa}\frac{(2s)!}{4^s (s!)^2}(1-\lambda)^s\) the degree-\(\kappa\) Newton–Schulz polynomial applied for \(q\) iterations, \(O_t\) the orthogonalized update, and \(\eta\) the learning rate.
Reference: Jihwan Kim, Chenglin Fan, "DP-Muon: Differentially Private Optimization via Matrix-Orthogonalized Momentum", arXiv 2026. https://arxiv.org/abs/2605.12994