Skip to content

First-Order Optimizers

First-order optimizers update parameters using only gradients and accumulated gradient statistics such as momentum and second-moment estimates. This page covers the stochastic gradient descent lineage, the Adam family, and more recent sign-based and variance-reduced methods. The zij column gives the class name for optimizers already implemented in the package.

Optimizer Venue Paper Code zij
ASGD SIAM Journal on Control and Optimization 1992 Acceleration of Stochastic Approximation by Averaging community ASGD
Rprop ICNN 1993 A direct adaptive method for faster backpropagation learning: the RPROP algorithm community Rprop
Adagrad JMLR 2011 Adaptive Subgradient Methods for Online Learning and Stochastic Optimization community Adagrad
Adadelta arXiv 2012 ADADELTA: An Adaptive Learning Rate Method community Adadelta
RMSprop Lecture notes 2012 Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude community RMSprop
FTRL KDD 2013 Ad Click Prediction: a View from the Trenches
SGD ICML 2013 On the importance of initialization and momentum in deep learning community SGD
Adam ICLR 2015 Adam: A Method for Stochastic Optimization community Adam
AdaMax ICLR 2015 Adam: A Method for Stochastic Optimization community Adamax
Nadam ICLR Workshop 2016 Incorporating Nesterov Momentum into Adam community NAdam
LARS arXiv 2017 Large Batch Training of Convolutional Networks community LARS
SWATS arXiv 2017 Improving Generalization Performance by Switching from Adam to SGD community SWATS
A2Grad arXiv 2018 Optimal Adaptive and Accelerated Stochastic Gradient Descent community A2GradUni, A2GradInc, A2GradExp
AccSGD ICLR 2018 On the insufficiency of existing momentum schemes for Stochastic Optimization official AccSGD
AMSGrad ICLR 2018 On the Convergence of Adam and Beyond community
GADAM arXiv 2018 GADAM: Genetic-Evolutionary ADAM for Deep Neural Network Optimization
M-SVAG ICML 2018 Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients official
PID CVPR 2018 A PID Controller Approach for Stochastic Optimization of Deep Networks official PID
VR-SGD IEEE TKDE 2018 VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning
Yogi NeurIPS 2018 Adaptive Methods for Nonconvex Optimization community Yogi
AdaBound ICLR 2019 Adaptive Gradient Methods with Dynamic Bound of Learning Rate official AdaBound, AdaBoundW
AdaMod arXiv 2019 An Adaptive and Momental Bound Method for Stochastic Learning official AdaMod
AdamW ICLR 2019 Decoupled Weight Decay Regularization official AdamW
AdaShift ICLR 2019 AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods community AdaShift
AggMo ICLR 2019 Aggregated Momentum: Stability Through Passive Damping official AggMo
AvaGrad arXiv 2019 Domain-independent Dominance of Adaptive Methods official AvaGrad
HAdam NeurIPS Workshop 2019 On Higher-order Moments in Adam
HyperAdam AAAI 2019 HyperAdam: A Learnable Task-Adaptive Adam for Network Training
Lookahead NeurIPS 2019 Lookahead Optimizer: k steps forward, 1 step back community Lookahead
NosAdam IJCAI 2019 Nostalgic Adam: Weighting more of the past gradients when designing the adaptive learning rate
NovoGrad arXiv 2019 Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks community NovoGrad
QHAdam / QHM ICLR 2019 Quasi-hyperbolic momentum and Adam for deep learning official QHAdam, QHM
Ranger RAdam and Lookahead combination official Ranger
Sadam arXiv 2019 Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM
AdaBelief NeurIPS 2020 AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients official AdaBelief
Adam+ arXiv 2020 Adam+: A Stochastic Method with Adaptive Variance Reduction
AdamBS NeurIPS 2020 Adam with Bandit Sampling for Deep Learning
AdaSGD arXiv 2020 AdaSGD: Bridging the gap between SGD and Adam
Cayley SGD ICLR 2020 Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform official
clipped-SGD NeurIPS 2020 Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping official
DEAM ASONAM 2020 DEAM: Adaptive Momentum with Discriminative Weight for Stochastic Optimization
diffGrad IEEE TNNLS 2020 diffGrad: An Optimization Method for Convolutional Neural Networks official DiffGrad
EAdam arXiv 2020 EAdam Optimizer: How ε Impact Adam official
Fromage NeurIPS 2020 On the distance between two neural networks and the stability of learning official
Gradient Centralization (GC) ECCV 2020 Gradient Centralization: A New Optimization Technique for Deep Neural Networks official
LAMB ICLR 2020 Large Batch Optimization for Deep Learning: Training BERT in 76 minutes community Lamb
LaProp arXiv 2020 LaProp: Separating Momentum and Adaptivity in Adam official LaProp
NIGT ICML 2020 Momentum Improves Normalized SGD official
Padam IJCAI 2020 Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks official PAdam
signSGD ICML 2018 signSGD: Compressed Optimisation for Non-Convex Problems community SignSGD
pbSGD IJCAI 2020 pbSGD: Powered Stochastic Gradient Descent Methods for Accelerated Non-Convex Optimization official
PCGrad NeurIPS 2020 Gradient Surgery for Multi-Task Learning official
RAdam ICLR 2020 On the Variance of the Adaptive Learning Rate and Beyond official RAdam
SGD-G2 ICPR 2020 Stochastic Runge-Kutta methods and adaptive SGD-G2 stochastic gradient descent
ACMo AAAI 2021 ACMo: Angle-Calibrated Moment Methods for Stochastic Optimization
ACProp NeurIPS 2021 Momentum Centering and Asynchronous Update for Adaptive Gradient Methods official
AdaL arXiv 2021 AdaL: Adaptive Gradient Transformation Contributes to Convergences and Generalizations
AdamD arXiv 2021 AdamD: Improved bias-correction in Adam
AdamP ICLR 2021 AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights official AdamP
Adaptive Gradient Clipping (AGC) ICML 2021 High-Performance Large-Scale Image Recognition Without Normalization official
AngularGrad arXiv 2021 AngularGrad: A New Optimization Technique for Angular Convergence of Convolutional Neural Networks official
BGADAM IJCNN 2021 BGADAM: Boosting based Genetic-Evolutionary ADAM for Neural Network Optimization
Gravity arXiv 2021 Gravity Optimizer: a Kinematic Approach on Optimization in Deep Learning official Gravity
MADGRAD arXiv 2021 Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization official MADGRAD, MirrorMADGRAD
MaxVA ECML PKDD 2021 MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients official
Nero ICML 2021 Learning by Turning: Neural Architecture Aware Optimisation official
PNM ICML 2021 Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization official
AdaPNM ICML 2021 Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization official AdaPNM
Ranger21 arXiv 2021 Ranger21: a synergistic deep learning optimizer official Ranger21
SGDP ICLR 2021 AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights official SGDP
AdaFamily arXiv 2022 AdaFamily: A family of Adam-like adaptive gradient methods
Adai ICML 2022 Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momentum official Adai
AdamMC CVMI 2022 Moment Centralization based Gradient Descent Optimizers for Convolutional Neural Networks
Adan arXiv 2022 Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models official Adan
AdaSmooth arXiv 2022 AdaSmooth: An Adaptive Learning Rate Method based on Effective Ratio AdaSmooth
AEGDM Annals of Applied Mathematics 2022 An Adaptive Gradient Method with Energy and Momentum official
Amos arXiv 2022 Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale official Amos
GDA-AM ICLR 2022 GDA-AM: On the effectiveness of solving minimax optimization via Anderson Acceleration official
KOALA AAAI 2022 KOALA: A Kalman Optimization Algorithm with Loss Adaptivity official
RotoGrad ICLR 2022 RotoGrad: Gradient Homogenization in Multitask Learning official
SRSGD SIAM Journal on Imaging Sciences 2022 Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent
Step-Tuned SGD Neural Processing Letters 2022 Second-order step-size tuning of SGD for non-convex optimization
AdaInject IEEE TAI 2023 AdaInject: Injection Based Adaptive Gradient Descent Optimizers for Convolutional Neural Networks official
AdaNorm WACV 2023 AdaNorm: Adaptive Gradient Norm Correction based Optimizer for CNNs official AdaNorm
AGD NeurIPS 2023 AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix
Aida TMLR 2023 A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range official
Lion NeurIPS 2023 Symbolic Discovery of Optimization Algorithms official Lion
Lookaround NeurIPS 2023 Lookaround Optimizer: k steps around, 1 step average
MultiAdam ICML 2023 MultiAdam: Parameter-wise Scale-invariant Optimizer for Multiscale Training of Physics-informed Neural Networks
RLEKF AAAI 2023 RLEKF: An Optimizer for Deep Potential with Ab Initio Accuracy
Scheduled Weight Decay (SWD) NeurIPS 2023 On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective official
SGDF arXiv 2023 Signal Processing Meets SGD: From Momentum to Filter
StableAdamW NeurIPS 2023 Stable and low-precision training for large-scale vision-language models community StableAdamW
AdaAct ICDMW 2024 An Adaptive Method Stabilizing Activations for Enhanced Generalization
Adam-atan2 ICML 2024 Scaling Exponents Across Parameterizations and Optimizers community AdamAtan2
Adam-Rel NeurIPS 2024 Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timesteps
AdEMAMix arXiv 2024 The AdEMAMix Optimizer: Better, Faster, Older official AdEMAMix
ADOPT NeurIPS 2024 ADOPT: Modified Adam Can Converge with Any β₂ with the Optimal Rate official ADOPT
AGS-GD arXiv 2024 Anisotropic Gaussian Smoothing for Gradient-based Optimization
BADM arXiv 2024 BADM: Batch ADMM for Deep Learning
CaAdam arXiv 2024 CaAdam: Improving Adam optimizer using connection aware methods official
CAdam arXiv 2024 CAdam: Confidence-Based Optimization for Online Learning
Cautious Optimizers arXiv 2024 Cautious Optimizers: Improving Training with One Line of Code official
EXAdam arXiv 2024 EXAdam: The Power of Adaptive Cross-Moments official EXAdam
FAdam arXiv 2024 FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information community FAdam
GrokAdamW AdamW variant with Grokfast-style gradient amplification official GrokAdamW
Grokfast arXiv 2024 Grokfast: Accelerated Grokking by Amplifying Slow Gradients official
INNAprop arXiv 2024 A second-order-like optimizer with adaptive gradient scaling for deep learning official
KATE NeurIPS 2024 Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad official
MADA ICML 2024 MADA: Meta-Adaptive Optimizers through hyper-gradient Descent
RSGDM CCSB 2024 Reducing Bias in Deep Learning Optimization: The RSGDM Approach
SET-Adam ECML PKDD 2024 On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance
SNGM Science China Information Sciences 2024 Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training
SRMM JMLR 2024 Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates official
TAM arXiv 2024 Torque-Aware Momentum
WarpAdam arXiv 2024 WarpAdam: A new Adam optimizer based on Meta-Learning approach
AbsSADMM arXiv 2025 Stochastic ADMM with batch size adaptation for nonconvex nonsmooth optimization
AdamC arXiv 2025 Why Gradients Rapidly Increase Near the End of Training
AdamNX arXiv 2025 AdamNX: An Adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimate official
AdamS EMNLP 2025 AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training
adaNAPG arXiv 2025 Boosting Accelerated Proximal Gradient Method with Adaptive Sampling for Stochastic Composite Optimization
Ano arXiv 2025 ANO : Faster is Better in Noisy Landscape official
BCOS arXiv 2025 Stochastic Approximation with Block Coordinate Optimal Stepsizes official
Cautious Weight Decay arXiv 2025 Cautious Weight Decay community
Conda arXiv 2025 Conda: Column-Normalized Adam for Training Large Language Models Faster official
Coupled Adam ACL 2025 Better Embeddings with Coupled Adam
DecGD Machine Learning 2025 A New Adaptive Gradient Method with Gradient Decomposition
DEO arXiv 2025 Dimer-Enhanced Optimization: A First-Order Approach to Escaping Saddle Points in Neural Network Training official
EmoNavi An emotion-driven optimizer that feels loss and navigates accordingly official
MARS ICML 2025 MARS: Unleashing the Power of Variance Reduction for Training Large Models official MARS
FOCUS arXiv 2025 FOCUS: First Order Concentrated Updating Scheme official FOCUS
FSGDM ICLR 2025 On the Performance Analysis of Momentum Method: A Frequency Domain Perspective
Grams ICLR Workshop 2025 Grams: Gradient Descent with Adaptive Momentum Scaling official Grams
HGM arXiv 2025 Hindsight-Guided Momentum (HGM) Optimizer: An Approach to Adaptive Learning Rate
HVAdam AAAI 2025 HVAdam: A Full-Dimension Adaptive Optimizer
KO arXiv 2025 KO: Kinetics-inspired Neural Optimizer with PDE Simulation Approaches
KOALA++ NeurIPS 2025 KOALA++: Efficient Kalman-Based Optimization with Gradient-Covariance Products
Kourkoutas-Beta arXiv 2025 Kourkoutas-Beta: A Sunspike-Driven Adam Optimizer with Desert Flair official KourkoutasSoftmaxFlex
MIAdam AAAI 2025 A Method for Enhancing Generalization of Adam by Multiple Integrations official
μ²-SGD ICLR 2025 Do Stochastic, Feel Noiseless: Stable Stochastic Optimization via a Double Momentum Mechanism
⊥Grad (OrthoGrad) ICLR 2025 Grokking at the Edge of Numerical Stability official
Overshoot arXiv 2025 Overshoot: Taking advantage of future gradients in momentum-based stochastic optimization official
PadamP arXiv 2025 Adaptive Moment Estimation Optimization Algorithm Using Projection Gradient for Deep Learning
Simplified-AdEMAMix arXiv 2025 Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants official
LyAm arXiv 2025 LyAm: Robust Non-Convex Optimization for Stable Learning in Noisy Environments
NIRMAL arXiv 2025 Comparative Analysis of Novel NIRMAL Optimizer Against Adam and SGD with Momentum
SCSAdamW arXiv 2025 Beyond First-Order: Training LLMs with Stochastic Conjugate Subgradients and AdamW official
SKA-SGD arXiv 2025 Streaming Krylov-Accelerated Stochastic Gradient Descent
SoftSignSGD (S3) arXiv 2025 SoftSignSGD(S3): An Enhanced Optimizer for Practical DNN Training and Loss Spikes Minimization Beyond Adam
SPAM arXiv 2025 SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training official
VSGD TMLR 2025 Variational Stochastic Gradient Descent for Deep Neural Networks official
ZetA arXiv 2025 ZetA: A Riemann Zeta-Scaled Extension of Adam for Deep Learning
AdaGC ICML 2026 AdaGC: Enhancing LLM Pretraining Stability via Adaptive Gradient Clipping AdaGC
Anon arXiv 2026 Anon: Extrapolating Adaptivity Beyond SGD and Adam
C-Adam arXiv 2026 A Theoretical and Experimental Study of a Novel Adaptive Learning Algorithm
DualAdam arXiv 2026 Combining Adam and its Inverse Counterpart to Enhance Generalization of Deep Learning Optimizers official
FANoS arXiv 2026 FANoS-v2: Feedback-Controlled Momentum with Thermostat Damping for Lightweight Neural Optimization official
GradPower ICML 2026 GradPower: Powering Gradients for Faster Language Model Pre-Training
HomeAdam arXiv 2026 HomeAdam: Adam and AdamW Algorithms Sometimes Go Home to Obtain Better Provable Generalization
NOVAK arXiv 2026 NOVAK: Unified adaptive optimizer for deep neural networks
PS-Clip-SGD arXiv 2026 Robust and Fast Training via Per-Sample Clipping
SparseOpt ICML 2026 SparseOpt: Addressing Normalization-induced Gradient Skew in Sparse Training
Stable-SPAM / GradientStabilizer ICML 2026 GradientStabilizer: Fix the Norm, Not the Gradient official
VRAdam ICLR 2026 A Physics-Inspired Optimizer: Velocity Regularized Adam official
SparseAdam Adam variant for sparse gradients official SparseAdam
OptMuon arXiv 2026 OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality
FOGO arXiv 2026 FOGO: Forgetting-aware Orthogonalization Optimizer
AdamO ICML 2026 Preserving Plasticity in Continual Learning via Dynamical Isometry
MAdam arXiv 2026 MAdam: Metric-Aware Multi-Objective Adam
MuCon arXiv 2026 MuCon: Clipped Muon Updates for LLM Training
NuMuon arXiv 2026 NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training
MiMuon arXiv 2026 MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
Pion arXiv preprint (cs.LG, stat.ML) 2026 Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation official
iMuon (Intrinsic Muon) arXiv 2026 Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds official
Muon-OGD arXiv 2026 Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning
Newton-Muon arXiv 2026 The Newton-Muon Optimizer official
MuonEq arXiv 2026 MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration official
RMNP arXiv 2026 RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization official
MUD arXiv preprint 2026 Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training
NAMO arXiv 2026 Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum official
SpecMuon arXiv 2026 Muon with Spectral Guidance: Efficient Optimization for Scientific Machine Learning
ARO arXiv 2026 ARO: A New Lens On Matrix Optimization For Large Models
PRISM arXiv 2026 PRISM: Structured Optimization via Anisotropic Spectral Shaping
MCSD / SPEL arXiv 2026 Manifold constrained steepest descent
Variance-Adaptive Muon (Muon-NSR / Muon-VS) arXiv 2026 Variance-Adaptive Muon: Accelerating LLM Pretraining with NSR-Modulated and Variance-Scaled Momentum
MuonAll arXiv 2025 MuonAll: Muon Variant for Efficient Finetuning of Large Language Models official
Gluon arXiv 2025 (also accepted at ICML 2025 HiLD workshop) Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs)
LPSGD / LPSGDM arXiv 2026 Beyond L2-norm and L-infinity-norm: A Curvature-Inspired ell_p-Norm Scheme for Deep Neural Networks
ABSignSGD ICLR 2026 Arbitrary-Order Block SignSGD for Memory-Efficient LLM Fine-Tuning
StoSignSGD arXiv 2026 StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models
Hybrid SignSGD-SGD switching arXiv 2026 Enhancing SignSGD: Small-Batch Convergence Analysis and a Hybrid Switching Strategy
SoftSignum / SoftMuon ICML 2026 Softsign: Smooth Sign in Your Optimizer For Better Parameter Heterogeneity Handling official
Accelerated SignGD arXiv 2025 Norm-Constrained Flows and Sign-Based Optimization: Theory and Algorithms
CLion arXiv 2026 CLion: Efficient Cautious Lion Optimizer with Enhanced Generalization
OLion arXiv 2026 OLion: Approaching the Hadamard Ideal by Intersecting Spectral and ell_{infty} Implicit Biases official
MGUP NeurIPS 2025 MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization official
Magma arXiv 2026 On Surprising Effectiveness of Masking Updates in Adaptive Optimizers
AGGC ACL 2026 AGGC: Adaptive Group Gradient Clipping for Stabilizing Large Language Model Training official
Clipped Scion NeurIPS 2025 Generalized Gradient Norm Clipping & Non-Euclidean (L_0,L_1)-Smoothness official
SPECTRA ICML 2026 Enhancing LLM Training via Spectral Clipping official
Spectral Clipping (matrix-valued) arXiv 2026 Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
SPAMP ACM Multimedia Asia 2025 (7th ACM International Conference on Multimedia in Asia) Gradient Shaping Beyond Clipping: A Functional Perspective on Update Magnitude Control
NucGD arXiv 2026 Towards The Implicit Bias on Multiclass Separable Data Under Norm Constraints official
Batched / Transported Scion arXiv 2026 Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise
EMA bias-corrected iterate averaging NeurIPS 2025 Workshop (OPT 2025) EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes
RGrad-Avg OPT 2025 (17th Annual Workshop on Optimization for Machine Learning, co-located with NeurIPS 2025) On Riemannian Gradient Descent Algorithm using gradient averaging
SGD with adaptive preconditioning ICLR 2026 SGD with Adaptive Preconditioning: Unified Analysis and Momentum Acceleration
HTMuon arXiv 2026 HTMuon: Improving Muon via Heavy-Tailed Spectral Correction official
MARS-M arXiv 2025 MARS-M: When Variance Reduction Meets Matrices official
Drop-Muon arXiv 2025 Drop-Muon: Update Less, Converge Faster
Muon+ arXiv 2026 MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training official
TrasMuon ICLR 2026 Workshop Sci4DL TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers
Adam-SHANG arXiv 2026 Adam-SHANG: A Convergent Adam-Type Method for Stochastic Smooth Convex Optimization
EMA-Nesterov arXiv 2026 EMA-Nesterov: Stabilizing Nesterov's Lookahead for Accelerated Deep Learning Optimization
S-Adam arXiv 2026 Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization
IAdaPID-ADG arXiv 2026 An Improved Adaptive PID Optimizer with Enhanced Convergence and Stability for Deep Learning
CT-AGD arXiv 2026 Accelerated Gradient Descent for Faster Convergence with Minimal Overhead
GPA (Generalized Primal Averaging) arXiv 2025 Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs official
SNOO arXiv 2025 SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients official
Riemannion ICLR 2026 LoRA meets Riemannion: Muon Optimizer for Parametrization-independent Low-Rank Adapters
Optimal Projection-Free Adaptive SGD arXiv 2026 Optimal Projection-Free Adaptive SGD for Matrix Optimization
AdamCB ICLR 2025 ADAM Optimization with Adaptive Batch Selection
Kalman-Adam Knowledge-Based Systems 2026 Kalman-Adam: Optimal bayesian moment estimation for memory-Efficient and generalizable deep learning
AdamHD (AdamHuberDecay) NeurIPS 2025 Workshop (ScaleOpt: GPU-Accelerated and Scalable Optimization) AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training
MVN-Grad arXiv 2026 Adaptive Optimization via Momentum on Variance-Normalized Gradients
Compositional Muon (CM) Tilde Research blog 2026 Towards Compositional Steepest Descent official