Skip to content

Second-Order and Orthogonalized Optimizers

Second-order and orthogonalized optimizers exploit curvature information or the matrix structure of gradients rather than purely elementwise first-order statistics. This group spans quasi-Newton and Hessian-diagonal methods (L-BFGS, AdaHessian, Sophia), full-matrix and Kronecker-factored preconditioning (PSGD, Shampoo, SOAP), and orthogonalized-update methods in the Muon family. Venues reflect peer-reviewed acceptance where applicable; otherwise the arXiv year is listed.

Optimizer Venue Paper Code zij
Gauss-Newton Method Biometrika 1974 Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method
Newton's Method ANL Technical Report 1982 Newton's method (ANL-82-8)
L-BFGS Mathematical Programming 1989 On the limited memory BFGS method for large scale optimization official LBFGS
Natural Gradient Neural Computation 1998 Natural Gradient Works Efficiently in Learning
K-FAC ICML 2015 Optimizing Neural Networks with Kronecker-factored Approximate Curvature
PSGD IEEE TNNLS 2018 Preconditioned Stochastic Gradient Descent official
Shampoo ICML 2018 Shampoo: Preconditioned Stochastic Tensor Optimization official Shampoo
AdaHessian AAAI 2021 ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning official Adahessian
Apollo arXiv 2020 Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization official
K-BFGS / K-BFGS(L) NeurIPS 2020 Practical Quasi-Newton Methods for Training Deep Neural Networks
SGN arXiv 2020 On the Promise of the Stochastic Generalized Gauss-Newton Method for Training DNNs
SpiderSQN IEEE TNNLS 2022 Faster Stochastic Quasi-Newton Methods
TKFAC AAAI 2021 A Trace-restricted Kronecker-Factored Approximation to Natural Gradient
SGDHess NeurIPS 2022 Better SGD using Second-order Momentum
SketchySGD SIMODS 2024 SketchySGD: Reliable Stochastic Optimization via Randomized Curvature Estimates official
Distributed Shampoo arXiv 2023 A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale official
mL-BFGS TMLR 2023 mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural Network Optimization
Sophia ICLR 2024 Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training official SophiaG
AdaFisher ICLR 2025 AdaFisher: Adaptive Second Order Optimization via Fisher Information official
CRNAS arXiv 2024 Novel Optimization Techniques for Parameter Estimation
HesScale ICML 2024 Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learning official
Muon Blog post 2024 Muon: An optimizer for hidden layers in neural networks official Muon
NysAct IEEE BigData 2024 NysAct: A Scalable Preconditioned Gradient Descent using Nystrom Approximation
OptiQ arXiv 2024 Second-Order Optimization via Quiescence
Q-Newton arXiv 2024 Q-Newton: Hybrid Quantum-Classical Scheduling for Accelerating Neural Network Training with Newton's Gradient Descent official
SOAA arXiv 2024 Efficient Second-Order Neural Network Optimization via Adaptive Trust Region Methods
SOAP ICLR 2025 SOAP: Improving and Stabilizing Shampoo using Adam for Language Modeling official SOAP
AdaDiag arXiv 2025 Improving Adaptive Moment Optimization via Preconditioner Diagonalization
ADAGB2 arXiv 2025 Fast Stochastic Second-Order Adagrad for Nonconvex Bound-Constrained Optimization
AdaGO arXiv 2025 AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates
AdaMuon arXiv 2025 AdaMuon: Adaptive Muon Optimizer official AdaMuon
ASGO NeurIPS 2025 ASGO: Adaptive Structured Gradient Optimization official
AuON arXiv 2025 AuON: A Linear-time Alternative to Orthogonal Momentum Updates official
COSMOS arXiv 2025 COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs official
FUSE IEEE CAI 2025 FUSE: First-Order and Second-Order Unified SynthEsis in Stochastic Optimization
Hessian-aware Scaling arXiv 2025 First-ish Order Methods: Hessian-aware Scalings of Gradient Descent
MAC IEEE ICDM 2025 MAC: An Efficient Gradient Preconditioning using Mean Activation Approximated Curvature
MuonClip arXiv 2025 Kimi K2: Open Agentic Intelligence community
NorMuon ICML 2026 NorMuon: Making Muon more efficient and scalable official NorMuon
OCAR ICML 2025 Online Curvature-Aware Replay: Leveraging 2nd Order Information for Online Continual Learning
PolarGrad arXiv 2025 PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective official PolarGrad
ROOT arXiv 2025 ROOT: Robust Orthogonalized Optimizer for Neural Network Training official
S-BFGS arXiv 2025 Efficient Stochastic BFGS methods Inspired by Bayesian Principles
SASSHA ICML 2025 SASSHA: Sharpness-aware Adaptive Second-order Optimization with Stable Hessian Approximation official
Scion ICML 2025 Training Deep Learning Models with Norm-Constrained LMOs official Scion
SPlus arXiv 2025 A Stable Whitening Optimizer for Efficient Neural Network Training official SPlus
Muon^2 arXiv 2026 Muon^2: Boosting Muon via Adaptive Second-Moment Preconditioning
Nora arXiv 2026 Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer
Pion arXiv 2026 Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
Spectral Sphere Optimizer (SSO) arXiv 2026 Controlled LLM Training on Spectral Sphere official
LoRA-Muon arXiv 2026 LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold
FOAM arXiv 2026 FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo
Mousse arXiv 2026 Mousse: Rectifying the Geometry of Muon with Curvature-Aware Preconditioning official
FISMO arXiv 2026 FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer
DyKAF arXiv 2025 DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning
Double Preconditioning (DoPr) arXiv 2026 Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss
AdaCubic TMLR 2026 AdaCubic: An Adaptive Cubic Regularization Optimizer for Deep Learning official
IFNSO arXiv 2026 IFNSO: Iteration-Free Newton-Schulz Orthogonalization official
CAO arXiv preprint 2025 CAO: Curvature-Adaptive Optimization via Periodic Low-Rank Hessian Sketching
Turbo-Muon arXiv 2025 Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning official
SR1 Cubic Quasi-Newton arXiv 2025 Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization
KL-Shampoo ICLR 2026 Understanding and Improving Shampoo and SOAP via Kullback-Leibler Minimization official
LLQR arXiv 2026 Layerwise LQR for Geometry-Aware Optimization of Deep Networks official
Freon / Kaon arXiv 2026 Muon is Not That Special: Random or Inverted Spectra Work Just as Well
Mano arXiv 2026 Mano: Restriking Manifold Optimization for LLM Training official
Atlas OPT 2025: 17th Annual Workshop on Optimization for Machine Learning (co-located with NeurIPS 2025) Atlas – Rethinking Optimizer Design for Stability and Speed