ZO-Muon¶
Implements ZO-Muon, a zeroth-order optimizer that orthogonalizes a subspace gradient estimate before the step.
ZO-Muon brings Muon-style orthogonalization to derivative-free training. At each step it draws random perturbation matrices in a low-dimensional subspace, lifts them through a column-orthonormal projection \(P\), and forms a randomized finite-difference estimate of the gradient from forward function evaluations only. The matrix sign function (computed by Newton-Schulz iteration) then whitens this estimate by equalizing its active singular directions, and the result is projected back into parameter space for the update.
where \(\theta_t\) are the parameters, \(\gamma_t\) is the learning rate, \(P\) is a column-orthonormal projection matrix (resampled periodically via QR of a random Gaussian matrix), \(\Psi_i\) are random Gaussian perturbations sampled in the subspace, \(N_q\) is the number of queries per step, \(\mu > 0\) is the smoothing radius, and \(\mathrm{msign}(\cdot)\) is the matrix sign function approximated by a Newton-Schulz iteration.
Reference: Lang, Wang, Zhang, Hong, Zhang, Yin, and Liu, "Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization", arXiv preprint 2026. https://arxiv.org/abs/2602.17155