| Gauss-Newton Method |
Biometrika 1974 |
Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method |
— |
— |
| Newton's Method |
ANL Technical Report 1982 |
Newton's method (ANL-82-8) |
— |
— |
| L-BFGS |
Mathematical Programming 1989 |
On the limited memory BFGS method for large scale optimization |
official |
LBFGS |
| Natural Gradient |
Neural Computation 1998 |
Natural Gradient Works Efficiently in Learning |
— |
— |
| K-FAC |
ICML 2015 |
Optimizing Neural Networks with Kronecker-factored Approximate Curvature |
— |
— |
| PSGD |
IEEE TNNLS 2018 |
Preconditioned Stochastic Gradient Descent |
official |
— |
| Shampoo |
ICML 2018 |
Shampoo: Preconditioned Stochastic Tensor Optimization |
official |
Shampoo |
| AdaHessian |
AAAI 2021 |
ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning |
official |
Adahessian |
| Apollo |
arXiv 2020 |
Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization |
official |
— |
| K-BFGS / K-BFGS(L) |
NeurIPS 2020 |
Practical Quasi-Newton Methods for Training Deep Neural Networks |
— |
— |
| SGN |
arXiv 2020 |
On the Promise of the Stochastic Generalized Gauss-Newton Method for Training DNNs |
— |
— |
| SpiderSQN |
IEEE TNNLS 2022 |
Faster Stochastic Quasi-Newton Methods |
— |
— |
| TKFAC |
AAAI 2021 |
A Trace-restricted Kronecker-Factored Approximation to Natural Gradient |
— |
— |
| SGDHess |
NeurIPS 2022 |
Better SGD using Second-order Momentum |
— |
— |
| SketchySGD |
SIMODS 2024 |
SketchySGD: Reliable Stochastic Optimization via Randomized Curvature Estimates |
official |
— |
| Distributed Shampoo |
arXiv 2023 |
A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale |
official |
— |
| mL-BFGS |
TMLR 2023 |
mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural Network Optimization |
— |
— |
| Sophia |
ICLR 2024 |
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training |
official |
SophiaG |
| AdaFisher |
ICLR 2025 |
AdaFisher: Adaptive Second Order Optimization via Fisher Information |
official |
— |
| CRNAS |
arXiv 2024 |
Novel Optimization Techniques for Parameter Estimation |
— |
— |
| HesScale |
ICML 2024 |
Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learning |
official |
— |
| Muon |
Blog post 2024 |
Muon: An optimizer for hidden layers in neural networks |
official |
Muon |
| NysAct |
IEEE BigData 2024 |
NysAct: A Scalable Preconditioned Gradient Descent using Nystrom Approximation |
— |
— |
| OptiQ |
arXiv 2024 |
Second-Order Optimization via Quiescence |
— |
— |
| Q-Newton |
arXiv 2024 |
Q-Newton: Hybrid Quantum-Classical Scheduling for Accelerating Neural Network Training with Newton's Gradient Descent |
official |
— |
| SOAA |
arXiv 2024 |
Efficient Second-Order Neural Network Optimization via Adaptive Trust Region Methods |
— |
— |
| SOAP |
ICLR 2025 |
SOAP: Improving and Stabilizing Shampoo using Adam for Language Modeling |
official |
SOAP |
| AdaDiag |
arXiv 2025 |
Improving Adaptive Moment Optimization via Preconditioner Diagonalization |
— |
— |
| ADAGB2 |
arXiv 2025 |
Fast Stochastic Second-Order Adagrad for Nonconvex Bound-Constrained Optimization |
— |
— |
| AdaGO |
arXiv 2025 |
AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates |
— |
— |
| AdaMuon |
arXiv 2025 |
AdaMuon: Adaptive Muon Optimizer |
official |
AdaMuon |
| ASGO |
NeurIPS 2025 |
ASGO: Adaptive Structured Gradient Optimization |
official |
— |
| AuON |
arXiv 2025 |
AuON: A Linear-time Alternative to Orthogonal Momentum Updates |
official |
— |
| COSMOS |
arXiv 2025 |
COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs |
official |
— |
| FUSE |
IEEE CAI 2025 |
FUSE: First-Order and Second-Order Unified SynthEsis in Stochastic Optimization |
— |
— |
| Hessian-aware Scaling |
arXiv 2025 |
First-ish Order Methods: Hessian-aware Scalings of Gradient Descent |
— |
— |
| MAC |
IEEE ICDM 2025 |
MAC: An Efficient Gradient Preconditioning using Mean Activation Approximated Curvature |
— |
— |
| MuonClip |
arXiv 2025 |
Kimi K2: Open Agentic Intelligence |
community |
— |
| NorMuon |
ICML 2026 |
NorMuon: Making Muon more efficient and scalable |
official |
NorMuon |
| OCAR |
ICML 2025 |
Online Curvature-Aware Replay: Leveraging 2nd Order Information for Online Continual Learning |
— |
— |
| PolarGrad |
arXiv 2025 |
PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective |
official |
PolarGrad |
| ROOT |
arXiv 2025 |
ROOT: Robust Orthogonalized Optimizer for Neural Network Training |
official |
— |
| S-BFGS |
arXiv 2025 |
Efficient Stochastic BFGS methods Inspired by Bayesian Principles |
— |
— |
| SASSHA |
ICML 2025 |
SASSHA: Sharpness-aware Adaptive Second-order Optimization with Stable Hessian Approximation |
official |
— |
| Scion |
ICML 2025 |
Training Deep Learning Models with Norm-Constrained LMOs |
official |
Scion |
| SPlus |
arXiv 2025 |
A Stable Whitening Optimizer for Efficient Neural Network Training |
official |
SPlus |
| Muon^2 |
arXiv 2026 |
Muon^2: Boosting Muon via Adaptive Second-Moment Preconditioning |
— |
— |
| Nora |
arXiv 2026 |
Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer |
— |
— |
| Pion |
arXiv 2026 |
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR |
— |
— |
| Spectral Sphere Optimizer (SSO) |
arXiv 2026 |
Controlled LLM Training on Spectral Sphere |
official |
— |
| LoRA-Muon |
arXiv 2026 |
LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold |
— |
— |
| FOAM |
arXiv 2026 |
FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo |
— |
— |
| Mousse |
arXiv 2026 |
Mousse: Rectifying the Geometry of Muon with Curvature-Aware Preconditioning |
official |
— |
| FISMO |
arXiv 2026 |
FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer |
— |
— |
| DyKAF |
arXiv 2025 |
DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning |
— |
— |
| Double Preconditioning (DoPr) |
arXiv 2026 |
Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss |
— |
— |
| AdaCubic |
TMLR 2026 |
AdaCubic: An Adaptive Cubic Regularization Optimizer for Deep Learning |
official |
— |
| IFNSO |
arXiv 2026 |
IFNSO: Iteration-Free Newton-Schulz Orthogonalization |
official |
— |
| CAO |
arXiv preprint 2025 |
CAO: Curvature-Adaptive Optimization via Periodic Low-Rank Hessian Sketching |
— |
— |
| Turbo-Muon |
arXiv 2025 |
Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning |
official |
— |
| SR1 Cubic Quasi-Newton |
arXiv 2025 |
Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization |
— |
— |
| KL-Shampoo |
ICLR 2026 |
Understanding and Improving Shampoo and SOAP via Kullback-Leibler Minimization |
official |
— |
| LLQR |
arXiv 2026 |
Layerwise LQR for Geometry-Aware Optimization of Deep Networks |
official |
— |
| Freon / Kaon |
arXiv 2026 |
Muon is Not That Special: Random or Inverted Spectra Work Just as Well |
— |
— |
| Mano |
arXiv 2026 |
Mano: Restriking Manifold Optimization for LLM Training |
official |
— |
| Atlas |
OPT 2025: 17th Annual Workshop on Optimization for Machine Learning (co-located with NeurIPS 2025) |
Atlas – Rethinking Optimizer Design for Stability and Speed |
— |
— |