FISMO¶
Implements FISMO (Fisher-Structured Momentum-Orthogonalized optimizer), a Kronecker-factored second-order method that orthogonalizes momentum in a whitened geometry.
FISMO models the per-layer Fisher information as a Kronecker product of a left factor \(P_t \in \mathbb{R}^{m\times m}\) and a right factor \(Q_t \in \mathbb{R}^{n\times n}\), each maintained by an exponential moving average and trace-normalized to control scale. The raw gradient is whitened by these factors before momentum is accumulated, and the momentum is then orthogonalized with the matrix polar factor (computed via Newton-Schulz iterations, \(\mathrm{Polar}(M)=UV^\top\) for an SVD \(M=U\Sigma V^\top\)). The orthogonalized step is mapped back through the preconditioners, combining the curvature awareness of Fisher methods with the spectral conditioning of Muon-style updates.
where \(W_t\in\mathbb{R}^{m\times n}\) are the matrix-shaped parameters, \(G_t\) the minibatch gradient, \(\eta\) the learning rate, \(\beta\) the momentum coefficient, \(\gamma\) the EMA decay of the Fisher factors, \(\mu\) the damping factor, \(\mathrm{sym}(A)=\tfrac12(A+A^\top)\) symmetrizes, and \(\mathrm{Polar}(M)=UV^\top\) is the orthogonal polar factor of \(M\).
Reference: Chenrui Xu, Wenjing Yan, Ying-Jun Angela Zhang, "FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer", ICML 2026. https://arxiv.org/abs/2601.21750