MuonClip¶
Implements MuonClip, the Muon optimizer augmented with QK-Clip to bound attention logits during large-scale training.
MuonClip keeps the standard Muon update — momentum on the gradient, orthogonalized by a Newton–Schulz iteration and scaled to match RMS — but adds a post-update rescaling step called QK-Clip. After each step, any attention head whose maximum logit \(S^h_{\max}\) exceeds a threshold \(\tau\) has its query/key projection weights shrunk by a per-head factor \(\gamma_h\), capping logit growth without changing the forward or backward pass. For the MLA layout, the non-rotary query and key components \(q^C, k^C\) are each scaled by \(\sqrt{\gamma_h}\) and the per-head rotary query \(q^R\) by \(\gamma_h\), while the shared rotary key \(k^R\) is left untouched so the clip does not couple across heads.
where \(\theta\) are the parameters (weight matrix of shape \(n\times m\)), \(\eta\) the learning rate, \(g_t\) the gradient, \(M_t\) the momentum buffer, \(\mu\) the momentum coefficient, \(\lambda\) the weight decay, and \(\mathrm{NewtonSchulz}(\cdot)\) the orthogonalization iteration. \(S^h_{\max}\) is the largest attention logit for head \(h\) over batch \(B\) (with \(Q^h_i, K^h_j\) the query/key for tokens \(i,j\) and \(d\) the head dimension), \(\tau\) the logit threshold, and \(\gamma_h\) the resulting per-head clip factor applied to the query/key components \(q^C, k^C, q^R\).
Reference: Kimi Team, "Kimi K2: Open Agentic Intelligence", arXiv 2025. https://arxiv.org/abs/2507.20534