| ASGD |
SIAM Journal on Control and Optimization 1992 |
Acceleration of Stochastic Approximation by Averaging |
community |
ASGD |
| Rprop |
ICNN 1993 |
A direct adaptive method for faster backpropagation learning: the RPROP algorithm |
community |
Rprop |
| Adagrad |
JMLR 2011 |
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization |
community |
Adagrad |
| Adadelta |
arXiv 2012 |
ADADELTA: An Adaptive Learning Rate Method |
community |
Adadelta |
| RMSprop |
Lecture notes 2012 |
Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude |
community |
RMSprop |
| FTRL |
KDD 2013 |
Ad Click Prediction: a View from the Trenches |
— |
— |
| SGD |
ICML 2013 |
On the importance of initialization and momentum in deep learning |
community |
SGD |
| Adam |
ICLR 2015 |
Adam: A Method for Stochastic Optimization |
community |
Adam |
| AdaMax |
ICLR 2015 |
Adam: A Method for Stochastic Optimization |
community |
Adamax |
| Nadam |
ICLR Workshop 2016 |
Incorporating Nesterov Momentum into Adam |
community |
NAdam |
| LARS |
arXiv 2017 |
Large Batch Training of Convolutional Networks |
community |
LARS |
| SWATS |
arXiv 2017 |
Improving Generalization Performance by Switching from Adam to SGD |
community |
SWATS |
| A2Grad |
arXiv 2018 |
Optimal Adaptive and Accelerated Stochastic Gradient Descent |
community |
A2GradUni, A2GradInc, A2GradExp |
| AccSGD |
ICLR 2018 |
On the insufficiency of existing momentum schemes for Stochastic Optimization |
official |
AccSGD |
| AMSGrad |
ICLR 2018 |
On the Convergence of Adam and Beyond |
community |
— |
| GADAM |
arXiv 2018 |
GADAM: Genetic-Evolutionary ADAM for Deep Neural Network Optimization |
— |
— |
| M-SVAG |
ICML 2018 |
Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients |
official |
— |
| PID |
CVPR 2018 |
A PID Controller Approach for Stochastic Optimization of Deep Networks |
official |
PID |
| VR-SGD |
IEEE TKDE 2018 |
VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning |
— |
— |
| Yogi |
NeurIPS 2018 |
Adaptive Methods for Nonconvex Optimization |
community |
Yogi |
| AdaBound |
ICLR 2019 |
Adaptive Gradient Methods with Dynamic Bound of Learning Rate |
official |
AdaBound, AdaBoundW |
| AdaMod |
arXiv 2019 |
An Adaptive and Momental Bound Method for Stochastic Learning |
official |
AdaMod |
| AdamW |
ICLR 2019 |
Decoupled Weight Decay Regularization |
official |
AdamW |
| AdaShift |
ICLR 2019 |
AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods |
community |
AdaShift |
| AggMo |
ICLR 2019 |
Aggregated Momentum: Stability Through Passive Damping |
official |
AggMo |
| AvaGrad |
arXiv 2019 |
Domain-independent Dominance of Adaptive Methods |
official |
AvaGrad |
| HAdam |
NeurIPS Workshop 2019 |
On Higher-order Moments in Adam |
— |
— |
| HyperAdam |
AAAI 2019 |
HyperAdam: A Learnable Task-Adaptive Adam for Network Training |
— |
— |
| Lookahead |
NeurIPS 2019 |
Lookahead Optimizer: k steps forward, 1 step back |
community |
Lookahead |
| NosAdam |
IJCAI 2019 |
Nostalgic Adam: Weighting more of the past gradients when designing the adaptive learning rate |
— |
— |
| NovoGrad |
arXiv 2019 |
Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks |
community |
NovoGrad |
| QHAdam / QHM |
ICLR 2019 |
Quasi-hyperbolic momentum and Adam for deep learning |
official |
QHAdam, QHM |
| Ranger |
— |
RAdam and Lookahead combination |
official |
Ranger |
| Sadam |
arXiv 2019 |
Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM |
— |
— |
| AdaBelief |
NeurIPS 2020 |
AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients |
official |
AdaBelief |
| Adam+ |
arXiv 2020 |
Adam+: A Stochastic Method with Adaptive Variance Reduction |
— |
— |
| AdamBS |
NeurIPS 2020 |
Adam with Bandit Sampling for Deep Learning |
— |
— |
| AdaSGD |
arXiv 2020 |
AdaSGD: Bridging the gap between SGD and Adam |
— |
— |
| Cayley SGD |
ICLR 2020 |
Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform |
official |
— |
| clipped-SGD |
NeurIPS 2020 |
Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping |
official |
— |
| DEAM |
ASONAM 2020 |
DEAM: Adaptive Momentum with Discriminative Weight for Stochastic Optimization |
— |
— |
| diffGrad |
IEEE TNNLS 2020 |
diffGrad: An Optimization Method for Convolutional Neural Networks |
official |
DiffGrad |
| EAdam |
arXiv 2020 |
EAdam Optimizer: How ε Impact Adam |
official |
— |
| Fromage |
NeurIPS 2020 |
On the distance between two neural networks and the stability of learning |
official |
— |
| Gradient Centralization (GC) |
ECCV 2020 |
Gradient Centralization: A New Optimization Technique for Deep Neural Networks |
official |
— |
| LAMB |
ICLR 2020 |
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes |
community |
Lamb |
| LaProp |
arXiv 2020 |
LaProp: Separating Momentum and Adaptivity in Adam |
official |
LaProp |
| NIGT |
ICML 2020 |
Momentum Improves Normalized SGD |
official |
— |
| Padam |
IJCAI 2020 |
Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks |
official |
PAdam |
| signSGD |
ICML 2018 |
signSGD: Compressed Optimisation for Non-Convex Problems |
community |
SignSGD |
| pbSGD |
IJCAI 2020 |
pbSGD: Powered Stochastic Gradient Descent Methods for Accelerated Non-Convex Optimization |
official |
— |
| PCGrad |
NeurIPS 2020 |
Gradient Surgery for Multi-Task Learning |
official |
— |
| RAdam |
ICLR 2020 |
On the Variance of the Adaptive Learning Rate and Beyond |
official |
RAdam |
| SGD-G2 |
ICPR 2020 |
Stochastic Runge-Kutta methods and adaptive SGD-G2 stochastic gradient descent |
— |
— |
| ACMo |
AAAI 2021 |
ACMo: Angle-Calibrated Moment Methods for Stochastic Optimization |
— |
— |
| ACProp |
NeurIPS 2021 |
Momentum Centering and Asynchronous Update for Adaptive Gradient Methods |
official |
— |
| AdaL |
arXiv 2021 |
AdaL: Adaptive Gradient Transformation Contributes to Convergences and Generalizations |
— |
— |
| AdamD |
arXiv 2021 |
AdamD: Improved bias-correction in Adam |
— |
— |
| AdamP |
ICLR 2021 |
AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights |
official |
AdamP |
| Adaptive Gradient Clipping (AGC) |
ICML 2021 |
High-Performance Large-Scale Image Recognition Without Normalization |
official |
— |
| AngularGrad |
arXiv 2021 |
AngularGrad: A New Optimization Technique for Angular Convergence of Convolutional Neural Networks |
official |
— |
| BGADAM |
IJCNN 2021 |
BGADAM: Boosting based Genetic-Evolutionary ADAM for Neural Network Optimization |
— |
— |
| Gravity |
arXiv 2021 |
Gravity Optimizer: a Kinematic Approach on Optimization in Deep Learning |
official |
Gravity |
| MADGRAD |
arXiv 2021 |
Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization |
official |
MADGRAD, MirrorMADGRAD |
| MaxVA |
ECML PKDD 2021 |
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients |
official |
— |
| Nero |
ICML 2021 |
Learning by Turning: Neural Architecture Aware Optimisation |
official |
— |
| PNM |
ICML 2021 |
Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization |
official |
— |
| AdaPNM |
ICML 2021 |
Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization |
official |
AdaPNM |
| Ranger21 |
arXiv 2021 |
Ranger21: a synergistic deep learning optimizer |
official |
Ranger21 |
| SGDP |
ICLR 2021 |
AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights |
official |
SGDP |
| AdaFamily |
arXiv 2022 |
AdaFamily: A family of Adam-like adaptive gradient methods |
— |
— |
| Adai |
ICML 2022 |
Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momentum |
official |
Adai |
| AdamMC |
CVMI 2022 |
Moment Centralization based Gradient Descent Optimizers for Convolutional Neural Networks |
— |
— |
| Adan |
arXiv 2022 |
Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models |
official |
Adan |
| AdaSmooth |
arXiv 2022 |
AdaSmooth: An Adaptive Learning Rate Method based on Effective Ratio |
— |
AdaSmooth |
| AEGDM |
Annals of Applied Mathematics 2022 |
An Adaptive Gradient Method with Energy and Momentum |
official |
— |
| Amos |
arXiv 2022 |
Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale |
official |
Amos |
| GDA-AM |
ICLR 2022 |
GDA-AM: On the effectiveness of solving minimax optimization via Anderson Acceleration |
official |
— |
| KOALA |
AAAI 2022 |
KOALA: A Kalman Optimization Algorithm with Loss Adaptivity |
official |
— |
| RotoGrad |
ICLR 2022 |
RotoGrad: Gradient Homogenization in Multitask Learning |
official |
— |
| SRSGD |
SIAM Journal on Imaging Sciences 2022 |
Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent |
— |
— |
| Step-Tuned SGD |
Neural Processing Letters 2022 |
Second-order step-size tuning of SGD for non-convex optimization |
— |
— |
| AdaInject |
IEEE TAI 2023 |
AdaInject: Injection Based Adaptive Gradient Descent Optimizers for Convolutional Neural Networks |
official |
— |
| AdaNorm |
WACV 2023 |
AdaNorm: Adaptive Gradient Norm Correction based Optimizer for CNNs |
official |
AdaNorm |
| AGD |
NeurIPS 2023 |
AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix |
— |
— |
| Aida |
TMLR 2023 |
A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range |
official |
— |
| Lion |
NeurIPS 2023 |
Symbolic Discovery of Optimization Algorithms |
official |
Lion |
| Lookaround |
NeurIPS 2023 |
Lookaround Optimizer: k steps around, 1 step average |
— |
— |
| MultiAdam |
ICML 2023 |
MultiAdam: Parameter-wise Scale-invariant Optimizer for Multiscale Training of Physics-informed Neural Networks |
— |
— |
| RLEKF |
AAAI 2023 |
RLEKF: An Optimizer for Deep Potential with Ab Initio Accuracy |
— |
— |
| Scheduled Weight Decay (SWD) |
NeurIPS 2023 |
On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective |
official |
— |
| SGDF |
arXiv 2023 |
Signal Processing Meets SGD: From Momentum to Filter |
— |
— |
| StableAdamW |
NeurIPS 2023 |
Stable and low-precision training for large-scale vision-language models |
community |
StableAdamW |
| AdaAct |
ICDMW 2024 |
An Adaptive Method Stabilizing Activations for Enhanced Generalization |
— |
— |
| Adam-atan2 |
ICML 2024 |
Scaling Exponents Across Parameterizations and Optimizers |
community |
AdamAtan2 |
| Adam-Rel |
NeurIPS 2024 |
Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timesteps |
— |
— |
| AdEMAMix |
arXiv 2024 |
The AdEMAMix Optimizer: Better, Faster, Older |
official |
AdEMAMix |
| ADOPT |
NeurIPS 2024 |
ADOPT: Modified Adam Can Converge with Any β₂ with the Optimal Rate |
official |
ADOPT |
| AGS-GD |
arXiv 2024 |
Anisotropic Gaussian Smoothing for Gradient-based Optimization |
— |
— |
| BADM |
arXiv 2024 |
BADM: Batch ADMM for Deep Learning |
— |
— |
| CaAdam |
arXiv 2024 |
CaAdam: Improving Adam optimizer using connection aware methods |
official |
— |
| CAdam |
arXiv 2024 |
CAdam: Confidence-Based Optimization for Online Learning |
— |
— |
| Cautious Optimizers |
arXiv 2024 |
Cautious Optimizers: Improving Training with One Line of Code |
official |
— |
| EXAdam |
arXiv 2024 |
EXAdam: The Power of Adaptive Cross-Moments |
official |
EXAdam |
| FAdam |
arXiv 2024 |
FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information |
community |
FAdam |
| GrokAdamW |
— |
AdamW variant with Grokfast-style gradient amplification |
official |
GrokAdamW |
| Grokfast |
arXiv 2024 |
Grokfast: Accelerated Grokking by Amplifying Slow Gradients |
official |
— |
| INNAprop |
arXiv 2024 |
A second-order-like optimizer with adaptive gradient scaling for deep learning |
official |
— |
| KATE |
NeurIPS 2024 |
Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad |
official |
— |
| MADA |
ICML 2024 |
MADA: Meta-Adaptive Optimizers through hyper-gradient Descent |
— |
— |
| RSGDM |
CCSB 2024 |
Reducing Bias in Deep Learning Optimization: The RSGDM Approach |
— |
— |
| SET-Adam |
ECML PKDD 2024 |
On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance |
— |
— |
| SNGM |
Science China Information Sciences 2024 |
Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training |
— |
— |
| SRMM |
JMLR 2024 |
Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates |
official |
— |
| TAM |
arXiv 2024 |
Torque-Aware Momentum |
— |
— |
| WarpAdam |
arXiv 2024 |
WarpAdam: A new Adam optimizer based on Meta-Learning approach |
— |
— |
| AbsSADMM |
arXiv 2025 |
Stochastic ADMM with batch size adaptation for nonconvex nonsmooth optimization |
— |
— |
| AdamC |
arXiv 2025 |
Why Gradients Rapidly Increase Near the End of Training |
— |
— |
| AdamNX |
arXiv 2025 |
AdamNX: An Adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimate |
official |
— |
| AdamS |
EMNLP 2025 |
AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training |
— |
— |
| adaNAPG |
arXiv 2025 |
Boosting Accelerated Proximal Gradient Method with Adaptive Sampling for Stochastic Composite Optimization |
— |
— |
| Ano |
arXiv 2025 |
ANO : Faster is Better in Noisy Landscape |
official |
— |
| BCOS |
arXiv 2025 |
Stochastic Approximation with Block Coordinate Optimal Stepsizes |
official |
— |
| Cautious Weight Decay |
arXiv 2025 |
Cautious Weight Decay |
community |
— |
| Conda |
arXiv 2025 |
Conda: Column-Normalized Adam for Training Large Language Models Faster |
official |
— |
| Coupled Adam |
ACL 2025 |
Better Embeddings with Coupled Adam |
— |
— |
| DecGD |
Machine Learning 2025 |
A New Adaptive Gradient Method with Gradient Decomposition |
— |
— |
| DEO |
arXiv 2025 |
Dimer-Enhanced Optimization: A First-Order Approach to Escaping Saddle Points in Neural Network Training |
official |
— |
| EmoNavi |
— |
An emotion-driven optimizer that feels loss and navigates accordingly |
official |
— |
| MARS |
ICML 2025 |
MARS: Unleashing the Power of Variance Reduction for Training Large Models |
official |
MARS |
| FOCUS |
arXiv 2025 |
FOCUS: First Order Concentrated Updating Scheme |
official |
FOCUS |
| FSGDM |
ICLR 2025 |
On the Performance Analysis of Momentum Method: A Frequency Domain Perspective |
— |
— |
| Grams |
ICLR Workshop 2025 |
Grams: Gradient Descent with Adaptive Momentum Scaling |
official |
Grams |
| HGM |
arXiv 2025 |
Hindsight-Guided Momentum (HGM) Optimizer: An Approach to Adaptive Learning Rate |
— |
— |
| HVAdam |
AAAI 2025 |
HVAdam: A Full-Dimension Adaptive Optimizer |
— |
— |
| KO |
arXiv 2025 |
KO: Kinetics-inspired Neural Optimizer with PDE Simulation Approaches |
— |
— |
| KOALA++ |
NeurIPS 2025 |
KOALA++: Efficient Kalman-Based Optimization with Gradient-Covariance Products |
— |
— |
| Kourkoutas-Beta |
arXiv 2025 |
Kourkoutas-Beta: A Sunspike-Driven Adam Optimizer with Desert Flair |
official |
KourkoutasSoftmaxFlex |
| MIAdam |
AAAI 2025 |
A Method for Enhancing Generalization of Adam by Multiple Integrations |
official |
— |
| μ²-SGD |
ICLR 2025 |
Do Stochastic, Feel Noiseless: Stable Stochastic Optimization via a Double Momentum Mechanism |
— |
— |
| ⊥Grad (OrthoGrad) |
ICLR 2025 |
Grokking at the Edge of Numerical Stability |
official |
— |
| Overshoot |
arXiv 2025 |
Overshoot: Taking advantage of future gradients in momentum-based stochastic optimization |
official |
— |
| PadamP |
arXiv 2025 |
Adaptive Moment Estimation Optimization Algorithm Using Projection Gradient for Deep Learning |
— |
— |
| Simplified-AdEMAMix |
arXiv 2025 |
Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants |
official |
— |
| LyAm |
arXiv 2025 |
LyAm: Robust Non-Convex Optimization for Stable Learning in Noisy Environments |
— |
— |
| NIRMAL |
arXiv 2025 |
Comparative Analysis of Novel NIRMAL Optimizer Against Adam and SGD with Momentum |
— |
— |
| SCSAdamW |
arXiv 2025 |
Beyond First-Order: Training LLMs with Stochastic Conjugate Subgradients and AdamW |
official |
— |
| SKA-SGD |
arXiv 2025 |
Streaming Krylov-Accelerated Stochastic Gradient Descent |
— |
— |
| SoftSignSGD (S3) |
arXiv 2025 |
SoftSignSGD(S3): An Enhanced Optimizer for Practical DNN Training and Loss Spikes Minimization Beyond Adam |
— |
— |
| SPAM |
arXiv 2025 |
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training |
official |
— |
| VSGD |
TMLR 2025 |
Variational Stochastic Gradient Descent for Deep Neural Networks |
official |
— |
| ZetA |
arXiv 2025 |
ZetA: A Riemann Zeta-Scaled Extension of Adam for Deep Learning |
— |
— |
| AdaGC |
ICML 2026 |
AdaGC: Enhancing LLM Pretraining Stability via Adaptive Gradient Clipping |
— |
AdaGC |
| Anon |
arXiv 2026 |
Anon: Extrapolating Adaptivity Beyond SGD and Adam |
— |
— |
| C-Adam |
arXiv 2026 |
A Theoretical and Experimental Study of a Novel Adaptive Learning Algorithm |
— |
— |
| DualAdam |
arXiv 2026 |
Combining Adam and its Inverse Counterpart to Enhance Generalization of Deep Learning Optimizers |
official |
— |
| FANoS |
arXiv 2026 |
FANoS-v2: Feedback-Controlled Momentum with Thermostat Damping for Lightweight Neural Optimization |
official |
— |
| GradPower |
ICML 2026 |
GradPower: Powering Gradients for Faster Language Model Pre-Training |
— |
— |
| HomeAdam |
arXiv 2026 |
HomeAdam: Adam and AdamW Algorithms Sometimes Go Home to Obtain Better Provable Generalization |
— |
— |
| NOVAK |
arXiv 2026 |
NOVAK: Unified adaptive optimizer for deep neural networks |
— |
— |
| PS-Clip-SGD |
arXiv 2026 |
Robust and Fast Training via Per-Sample Clipping |
— |
— |
| SparseOpt |
ICML 2026 |
SparseOpt: Addressing Normalization-induced Gradient Skew in Sparse Training |
— |
— |
| Stable-SPAM / GradientStabilizer |
ICML 2026 |
GradientStabilizer: Fix the Norm, Not the Gradient |
official |
— |
| VRAdam |
ICLR 2026 |
A Physics-Inspired Optimizer: Velocity Regularized Adam |
official |
— |
| SparseAdam |
— |
Adam variant for sparse gradients |
official |
SparseAdam |
| OptMuon |
arXiv 2026 |
OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality |
— |
— |
| FOGO |
arXiv 2026 |
FOGO: Forgetting-aware Orthogonalization Optimizer |
— |
— |
| AdamO |
ICML 2026 |
Preserving Plasticity in Continual Learning via Dynamical Isometry |
— |
— |
| MAdam |
arXiv 2026 |
MAdam: Metric-Aware Multi-Objective Adam |
— |
— |
| MuCon |
arXiv 2026 |
MuCon: Clipped Muon Updates for LLM Training |
— |
— |
| NuMuon |
arXiv 2026 |
NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training |
— |
— |
| MiMuon |
arXiv 2026 |
MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models |
— |
— |
| Pion |
arXiv preprint (cs.LG, stat.ML) 2026 |
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation |
official |
— |
| iMuon (Intrinsic Muon) |
arXiv 2026 |
Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds |
official |
— |
| Muon-OGD |
arXiv 2026 |
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning |
— |
— |
| Newton-Muon |
arXiv 2026 |
The Newton-Muon Optimizer |
official |
— |
| MuonEq |
arXiv 2026 |
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration |
official |
— |
| RMNP |
arXiv 2026 |
RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization |
official |
— |
| MUD |
arXiv preprint 2026 |
Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training |
— |
— |
| NAMO |
arXiv 2026 |
Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum |
official |
— |
| SpecMuon |
arXiv 2026 |
Muon with Spectral Guidance: Efficient Optimization for Scientific Machine Learning |
— |
— |
| ARO |
arXiv 2026 |
ARO: A New Lens On Matrix Optimization For Large Models |
— |
— |
| PRISM |
arXiv 2026 |
PRISM: Structured Optimization via Anisotropic Spectral Shaping |
— |
— |
| MCSD / SPEL |
arXiv 2026 |
Manifold constrained steepest descent |
— |
— |
| Variance-Adaptive Muon (Muon-NSR / Muon-VS) |
arXiv 2026 |
Variance-Adaptive Muon: Accelerating LLM Pretraining with NSR-Modulated and Variance-Scaled Momentum |
— |
— |
| MuonAll |
arXiv 2025 |
MuonAll: Muon Variant for Efficient Finetuning of Large Language Models |
official |
— |
| Gluon |
arXiv 2025 (also accepted at ICML 2025 HiLD workshop) |
Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs) |
— |
— |
| LPSGD / LPSGDM |
arXiv 2026 |
Beyond L2-norm and L-infinity-norm: A Curvature-Inspired ell_p-Norm Scheme for Deep Neural Networks |
— |
— |
| ABSignSGD |
ICLR 2026 |
Arbitrary-Order Block SignSGD for Memory-Efficient LLM Fine-Tuning |
— |
— |
| StoSignSGD |
arXiv 2026 |
StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models |
— |
— |
| Hybrid SignSGD-SGD switching |
arXiv 2026 |
Enhancing SignSGD: Small-Batch Convergence Analysis and a Hybrid Switching Strategy |
— |
— |
| SoftSignum / SoftMuon |
ICML 2026 |
Softsign: Smooth Sign in Your Optimizer For Better Parameter Heterogeneity Handling |
official |
— |
| Accelerated SignGD |
arXiv 2025 |
Norm-Constrained Flows and Sign-Based Optimization: Theory and Algorithms |
— |
— |
| CLion |
arXiv 2026 |
CLion: Efficient Cautious Lion Optimizer with Enhanced Generalization |
— |
— |
| OLion |
arXiv 2026 |
OLion: Approaching the Hadamard Ideal by Intersecting Spectral and ell_{infty} Implicit Biases |
official |
— |
| MGUP |
NeurIPS 2025 |
MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization |
official |
— |
| Magma |
arXiv 2026 |
On Surprising Effectiveness of Masking Updates in Adaptive Optimizers |
— |
— |
| AGGC |
ACL 2026 |
AGGC: Adaptive Group Gradient Clipping for Stabilizing Large Language Model Training |
official |
— |
| Clipped Scion |
NeurIPS 2025 |
Generalized Gradient Norm Clipping & Non-Euclidean (L_0,L_1)-Smoothness |
official |
— |
| SPECTRA |
ICML 2026 |
Enhancing LLM Training via Spectral Clipping |
official |
— |
| Spectral Clipping (matrix-valued) |
arXiv 2026 |
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters |
— |
— |
| SPAMP |
ACM Multimedia Asia 2025 (7th ACM International Conference on Multimedia in Asia) |
Gradient Shaping Beyond Clipping: A Functional Perspective on Update Magnitude Control |
— |
— |
| NucGD |
arXiv 2026 |
Towards The Implicit Bias on Multiclass Separable Data Under Norm Constraints |
official |
— |
| Batched / Transported Scion |
arXiv 2026 |
Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise |
— |
— |
| EMA bias-corrected iterate averaging |
NeurIPS 2025 Workshop (OPT 2025) |
EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes |
— |
— |
| RGrad-Avg |
OPT 2025 (17th Annual Workshop on Optimization for Machine Learning, co-located with NeurIPS 2025) |
On Riemannian Gradient Descent Algorithm using gradient averaging |
— |
— |
| SGD with adaptive preconditioning |
ICLR 2026 |
SGD with Adaptive Preconditioning: Unified Analysis and Momentum Acceleration |
— |
— |
| HTMuon |
arXiv 2026 |
HTMuon: Improving Muon via Heavy-Tailed Spectral Correction |
official |
— |
| MARS-M |
arXiv 2025 |
MARS-M: When Variance Reduction Meets Matrices |
official |
— |
| Drop-Muon |
arXiv 2025 |
Drop-Muon: Update Less, Converge Faster |
— |
— |
| Muon+ |
arXiv 2026 |
MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training |
official |
— |
| TrasMuon |
ICLR 2026 Workshop Sci4DL |
TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers |
— |
— |
| Adam-SHANG |
arXiv 2026 |
Adam-SHANG: A Convergent Adam-Type Method for Stochastic Smooth Convex Optimization |
— |
— |
| EMA-Nesterov |
arXiv 2026 |
EMA-Nesterov: Stabilizing Nesterov's Lookahead for Accelerated Deep Learning Optimization |
— |
— |
| S-Adam |
arXiv 2026 |
Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization |
— |
— |
| IAdaPID-ADG |
arXiv 2026 |
An Improved Adaptive PID Optimizer with Enhanced Convergence and Stability for Deep Learning |
— |
— |
| CT-AGD |
arXiv 2026 |
Accelerated Gradient Descent for Faster Convergence with Minimal Overhead |
— |
— |
| GPA (Generalized Primal Averaging) |
arXiv 2025 |
Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs |
official |
— |
| SNOO |
arXiv 2025 |
SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients |
official |
— |
| Riemannion |
ICLR 2026 |
LoRA meets Riemannion: Muon Optimizer for Parametrization-independent Low-Rank Adapters |
— |
— |
| Optimal Projection-Free Adaptive SGD |
arXiv 2026 |
Optimal Projection-Free Adaptive SGD for Matrix Optimization |
— |
— |
| AdamCB |
ICLR 2025 |
ADAM Optimization with Adaptive Batch Selection |
— |
— |
| Kalman-Adam |
Knowledge-Based Systems 2026 |
Kalman-Adam: Optimal bayesian moment estimation for memory-Efficient and generalizable deep learning |
— |
— |
| AdamHD (AdamHuberDecay) |
NeurIPS 2025 Workshop (ScaleOpt: GPU-Accelerated and Scalable Optimization) |
AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training |
— |
— |
| MVN-Grad |
arXiv 2026 |
Adaptive Optimization via Momentum on Variance-Normalized Gradients |
— |
— |
| Compositional Muon (CM) |
Tilde Research blog 2026 |
Towards Compositional Steepest Descent |
official |
— |