Skip to content

Memory-Efficient Optimizers

Memory-efficient optimizers reduce the optimizer-state memory that dominates large-model training budgets, where Adam-style methods store two extra full-precision values per parameter. The methods below cover factored second moments, 8-bit and 4-bit state quantization, low-rank gradient projection, block-coordinate updates, zeroth-order gradient estimates, and stateless update rules.

Optimizer Venue Paper Code zij
Adafactor ICML 2018 Adafactor: Adaptive Learning Rates with Sublinear Memory Cost official Adafactor
SM3 NeurIPS 2019 Memory-Efficient Adaptive Optimization official SM3
8-bit Optimizers ICLR 2022 8-bit Optimizers via Block-wise Quantization official
tpSGD arXiv 2022 Learning with Local Gradients at the Edge
4-bit Optimizers NeurIPS 2023 Memory Efficient Optimizers with 4-bit States official
Adalite GitHub 2023 Adalite: a custom optimizer based on Adafactor and LAMB official
AdaLomo ACL 2024 Findings AdaLomo: Low-memory Optimization with Adaptive Learning Rate official AdaLomo
CAME ACL 2023 CAME: Confidence-guided Adaptive Memory Efficient Optimization official CAME
Lion NeurIPS 2023 Symbolic Discovery of Optimization Algorithms official
LOMO ACL 2024 Full Parameter Fine-tuning for Large Language Models with Limited Resources official Lomo
MeZO NeurIPS 2023 Fine-Tuning Language Models with Just Forward Passes official
Tiger GitHub 2023 Tiger: A Tight-fisted Optimizer official Tiger
4-bit Shampoo NeurIPS 2024 4-bit Shampoo for Memory-Efficient Network Training official
Adam-mini ICLR 2025 Adam-mini: Use Fewer Learning Rates To Gain More official AdamMini
Adapprox arXiv 2024 Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices
AdaRankGrad ICLR 2025 AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning
Addax ICLR 2025 Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models official
APOLLO MLSys 2025 APOLLO: SGD-like Memory, AdamW-level Performance official APOLLO
BAdam NeurIPS 2024 BAdam: A Memory Efficient Full Parameter Optimization Method for Large Language Models official BlockOptimizer
COAP CVPR 2025 COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection official
Fira NeurIPS 2025 Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint? official FiraAdamW
Flora ICML 2024 Flora: Low-Rank Adapters Are Secretly Gradient Compressors official
FRUGAL ICML 2025 FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training official
GaLore ICML 2024 GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection official GaLoreAdamW
GoLore ICML 2025 Subspace Optimization for Large Language Models with Convergence Guarantees official
GRASS EMNLP 2024 Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients official
LDAdam ICLR 2025 LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics official LDAdamW
LoQT NeurIPS 2024 LoQT: Low-Rank Adapters for Quantized Pretraining official
LoRA-RITE ICLR 2025 LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization official
MicroAdam NeurIPS 2024 MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence official
Muon Blog 2024 Muon: An optimizer for hidden layers in neural networks official Muon
Online Subspace Descent NeurIPS 2024 Memory-Efficient LLM Training with Online Subspace Descent official
Q-GaLore CPAL 2025 Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients official
SGD-SaI arXiv 2024 No More Adam: Learning Rate Scaling at Initialization is All You Need official SGDSaI
SMMF AAAI 2025 SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization official
SNSM ICML 2025 Lean and Mean Adaptive Optimization via Subset-Norm and Subspace-Momentum with Convergence Guarantees official
SWAN ICML 2025 SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training
AlphaGrad arXiv 2025 AlphaGrad: Non-Linear Gradient Normalization Optimizer
GWT arXiv 2025 GWT: Scalable Optimizer State Compression for Large Language Model Training
MLorc AISTATS 2026 MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation official
MoFaSGD TMLR 2025 Low-rank Momentum Factorization for Memory Efficient Training official
RACS / Alice arXiv 2025 Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension community
SinkGD arXiv 2025 Gradient Multi-Normalization for Stateless and Scalable LLM Training
SPAM ICLR 2025 SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training official SPAM
SubTrack++ NeurIPS 2025 SubTrack++ : Gradient Subspace Tracking for Scalable LLM Training official
SUMO NeurIPS 2025 SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training
TensorGRaD arXiv 2025 TensorGRaD: Tensor Gradient Robust Decomposition for Memory-Efficient Neural Operator Training
FlashOptim arXiv 2026 FlashOptim: Optimizers for Memory-Efficient Training official
Rose GitHub 2026 Rose: Range-Of-Slice Equilibration optimizer official
SAGE ACL 2026 Findings SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization
BlockLLM arXiv 2024 BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks official
Natural GaLore arXiv 2024 Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning official
SLTrain NeurIPS 2024 SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining official
8-bit Muon arXiv 2025 Effective Quantization of Muon Optimizer States
FFT-based Subspace Selection ICLR 2026 FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models official
FOAM arXiv 2025 FOAM: Blocked State Folding for Memory-Efficient LLM Training official
GaLore 2 arXiv 2025 GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection
GradientStabilizer ICML 2026 GradientStabilizer: Fix the Norm, Not the Gradient official
GUM arXiv 2025 Unbiased Gradient Low-Rank Projection
I3S NeurIPS 2025 Breaking the Frozen Subspace: Importance Sampling for Low-Rank Optimization in LLM Pretraining
LORENZA TMLR 2026 LORENZA: Enhancing Generalization in Low-Rank Gradient LLM Training via Efficient Zeroth-Order Adaptive SAM
ProjFactor (VLoRP) arXiv 2025 Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients
RSO arXiv 2025 A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models
SCALE ICML 2026 Memory-Efficient LLM Pretraining via Minimalist Optimizer Design
SlimAdam arXiv 2025 When Can You Get Away with Low Memory Adam? official
LoRA-Pre ICLR 2026 Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation official
Lotus arXiv 2026 Lotus: Efficient LLM Training by Randomized Low-Rank Gradient Projection with Adaptive Subspace Switching
POET-X ICML 2026 POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation official
MuonQ arXiv 2026 MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization official
4-bit-Muon-GRASP ICLR 2026 Achieving low-bit Muon through subspace preservation and grid quantization official
IO-Adam OpenReview 2026 IO-Adam: Rethinking Memory-Efficient Adaptive Optimizers from Gradient Computation
H-Fac AISTATS 2025 Memory-Efficient Optimization with Factorized Hamiltonian Descent
LiMuon ICML 2026 LiMuon: Light and Fast Muon Optimizer for Large Models
M+Adam OPT 2025: 17th Annual Workshop on Optimization for Machine Learning (NeurIPS 2025 Workshop) M+Adam: Stable Low-Precision Training with Combined Adam–Madam Updates
SMET ICML 2026 Memory-Efficient LLM Training with Dynamic Sparsity: From Stability to Practical Scaling official
PowerStep arXiv 2026 PowerStep: Memory-Efficient Adaptive Optimization via ell_p-Norm Steepest Descent official
SRON OpenReview 2025 SRON: State-free LLM Training via Row-wise Gradient Normalization
GradLite arXiv 2025 Backward-Friendly Optimization: Training Large Language Models with Approximate Gradients under Memory Constraints
Optimal Low-Rank SGE arXiv preprint 2026 Optimal low-rank stochastic gradient estimation for LLM training
Spectral Compact Training (SCT) arXiv 2026 Spectral Compact Training: Pre-Training Large Language Models via Permanent Truncated SVD and Stiefel QR Retraction official

Trainer integrations

HuggingFace transformers exposes many of these methods through the optim argument of TrainingArguments. Each string value below maps to a memory-efficient optimizer; all backing libraries except the built-in Adafactor must be installed separately.

optim value Backing library
adafactor transformers ships its own Adafactor implementation with relative-step and update-clipping options (Apache-2.0).
adamw_bnb_8bit / adamw_8bit bitsandbytes AdamW with block-wise 8-bit quantized state (MIT).
paged_adamw_8bit / paged_adamw_32bit bitsandbytes paged AdamW; optimizer state is paged between GPU and CPU memory (MIT).
lion_8bit / lion_32bit / paged_lion_8bit / paged_lion_32bit bitsandbytes Lion, single momentum buffer, with 8-bit and paged variants (MIT).
ademamix_8bit / paged_ademamix_8bit / paged_ademamix_32bit bitsandbytes AdEMAMix with 8-bit quantized and paged state (MIT).
rmsprop_bnb_8bit bitsandbytes RMSprop with block-wise 8-bit quantized state (MIT).
adamw_torch_4bit / adamw_torch_8bit torchao pure-PyTorch AdamW with 4-bit or 8-bit optimizer states (BSD-3-Clause).
galore_adamw / galore_adamw_8bit / galore_adafactor and *_layerwise variants galore-torch, the official GaLore release (Apache-2.0).
apollo_adamw / apollo_adamw_layerwise apollo-torch, the official APOLLO release (CC-BY-NC-4.0).
lomo / adalomo lomo-optim, the official LOMO and AdaLomo release (MIT).