DeMuon¶
Implements DeMuon, a decentralized Muon for matrix optimization over networks.
DeMuon runs Muon across \(N\) workers connected by a communication graph, with no central server. Each worker \(i\) keeps a local momentum estimate \(M_i\), a gradient-tracking variable \(V_i\) that follows the network-wide gradient, and a local iterate \(X_i\). Mixing with neighbors through the doubly stochastic matrix \(W\) drives the iterates toward consensus, while the matrix-sign (orthogonalization) step is applied to the tracked gradient so the structure-aware Muon update is preserved in the distributed setting.
where \(X_i \in \mathbb{R}^{m\times n}\) is worker \(i\)'s parameter matrix, \(g_i^k\) its stochastic gradient, \(\theta \in (0,1)\) the momentum weight, \(\eta\) the step size, \(w_{ij}\) the entries of the mixing matrix \(W\) (nonzero only for neighbors), and \(\mathrm{msgn}(M) = UV^\top\) from the reduced SVD \(M = U\Sigma V^\top\) is the matrix sign that orthogonalizes the tracked gradient.
Reference: Chuan He, Shuyi Ren, Jingwei Mao, Erik G. Larsson, "DeMuon: A Decentralized Muon for Matrix Optimization over Graphs", arXiv 2025. https://arxiv.org/abs/2510.01377