ROOT¶
Implements ROOT, a robust orthogonalized optimizer that orthogonalizes the outlier-suppressed momentum.
ROOT builds on the matrix-orthogonalization idea of Muon but addresses two robustness gaps. First, instead of a fixed Newton–Schulz iteration, it uses an adaptive quintic iteration whose coefficients \(a^{(m,n)}, b^{(m,n)}, c^{(m,n)}\) are tuned to the parameter shape \(m \times n\), giving a more accurate orthogonalization across layers of differing dimensions. Second, before orthogonalizing, it applies element-wise soft-thresholding to the momentum to separate heavy-tailed outliers \(O_t\) from a clipped robust component \(B_t\); only \(B_t\) is orthogonalized and used to update the weights, so a few large entries cannot dominate the resulting search direction.
where \(\theta\) are the parameters, \(\eta\) the learning rate, \(g_t\) the gradient, \(M_t\) the momentum buffer, \(\mu\) its decay, \(\epsilon\) the soft-threshold, \(\mathcal{T}_\epsilon\) the element-wise soft-thresholding (outlier) operator, \(O_t\) the suppressed outliers, \(B_t\) the robust clipped component, \(\lVert\cdot\rVert_F\) the Frobenius norm, and \(X_K\) the adaptive Newton iterate after \(K\) steps (the orthogonalized direction \(B_t^{\mathrm{orth}}\)), with shape-dependent coefficients \(a^{(m,n)}, b^{(m,n)}, c^{(m,n)}\) for a parameter matrix of size \(m \times n\).
Reference: Wei He, Kai Han, Hang Zhou, Hanting Chen, Zhicheng Liu, Xinghao Chen, Yunhe Wang, "ROOT: Robust Orthogonalized Optimizer for Neural Network Training", 2025. https://arxiv.org/abs/2511.20626