Memory-Efficient Optimizers¶

Memory-efficient optimizers reduce the optimizer-state memory that dominates large-model training budgets, where Adam-style methods store two extra full-precision values per parameter. The methods below cover factored second moments, 8-bit and 4-bit state quantization, low-rank gradient projection, block-coordinate updates, zeroth-order gradient estimates, and stateless update rules.

Optimizer	Venue	Paper	Code	`zij`
Adafactor	ICML 2018	Adafactor: Adaptive Learning Rates with Sublinear Memory Cost	official	`Adafactor`
SM3	NeurIPS 2019	Memory-Efficient Adaptive Optimization	official	`SM3`
8-bit Optimizers	ICLR 2022	8-bit Optimizers via Block-wise Quantization	official	—
tpSGD	arXiv 2022	Learning with Local Gradients at the Edge	—	—
4-bit Optimizers	NeurIPS 2023	Memory Efficient Optimizers with 4-bit States	official	—
Adalite	GitHub 2023	Adalite: a custom optimizer based on Adafactor and LAMB	official	—
AdaLomo	ACL 2024 Findings	AdaLomo: Low-memory Optimization with Adaptive Learning Rate	official	`AdaLomo`
CAME	ACL 2023	CAME: Confidence-guided Adaptive Memory Efficient Optimization	official	`CAME`
Lion	NeurIPS 2023	Symbolic Discovery of Optimization Algorithms	official	—
LOMO	ACL 2024	Full Parameter Fine-tuning for Large Language Models with Limited Resources	official	`Lomo`
MeZO	NeurIPS 2023	Fine-Tuning Language Models with Just Forward Passes	official	—
Tiger	GitHub 2023	Tiger: A Tight-fisted Optimizer	official	`Tiger`
4-bit Shampoo	NeurIPS 2024	4-bit Shampoo for Memory-Efficient Network Training	official	—
Adam-mini	ICLR 2025	Adam-mini: Use Fewer Learning Rates To Gain More	official	`AdamMini`
Adapprox	arXiv 2024	Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices	—	—
AdaRankGrad	ICLR 2025	AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning	—	—
Addax	ICLR 2025	Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models	official	—
APOLLO	MLSys 2025	APOLLO: SGD-like Memory, AdamW-level Performance	official	`APOLLO`
BAdam	NeurIPS 2024	BAdam: A Memory Efficient Full Parameter Optimization Method for Large Language Models	official	`BlockOptimizer`
COAP	CVPR 2025	COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection	official	—
Fira	NeurIPS 2025	Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?	official	`FiraAdamW`
Flora	ICML 2024	Flora: Low-Rank Adapters Are Secretly Gradient Compressors	official	—
FRUGAL	ICML 2025	FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training	official	—
GaLore	ICML 2024	GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection	official	`GaLoreAdamW`
GoLore	ICML 2025	Subspace Optimization for Large Language Models with Convergence Guarantees	official	—
GRASS	EMNLP 2024	Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients	official	—
LDAdam	ICLR 2025	LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics	official	`LDAdamW`
LoQT	NeurIPS 2024	LoQT: Low-Rank Adapters for Quantized Pretraining	official	—
LoRA-RITE	ICLR 2025	LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization	official	—
MicroAdam	NeurIPS 2024	MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence	official	—
Muon	Blog 2024	Muon: An optimizer for hidden layers in neural networks	official	`Muon`
Online Subspace Descent	NeurIPS 2024	Memory-Efficient LLM Training with Online Subspace Descent	official	—
Q-GaLore	CPAL 2025	Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients	official	—
SGD-SaI	arXiv 2024	No More Adam: Learning Rate Scaling at Initialization is All You Need	official	`SGDSaI`
SMMF	AAAI 2025	SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization	official	—
SNSM	ICML 2025	Lean and Mean Adaptive Optimization via Subset-Norm and Subspace-Momentum with Convergence Guarantees	official	—
SWAN	ICML 2025	SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training	—	—
AlphaGrad	arXiv 2025	AlphaGrad: Non-Linear Gradient Normalization Optimizer	—	—
GWT	arXiv 2025	GWT: Scalable Optimizer State Compression for Large Language Model Training	—	—
MLorc	AISTATS 2026	MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation	official	—
MoFaSGD	TMLR 2025	Low-rank Momentum Factorization for Memory Efficient Training	official	—
RACS / Alice	arXiv 2025	Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension	community	—
SinkGD	arXiv 2025	Gradient Multi-Normalization for Stateless and Scalable LLM Training	—	—
SPAM	ICLR 2025	SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training	official	`SPAM`
SubTrack++	NeurIPS 2025	SubTrack++ : Gradient Subspace Tracking for Scalable LLM Training	official	—
SUMO	NeurIPS 2025	SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training	—	—
TensorGRaD	arXiv 2025	TensorGRaD: Tensor Gradient Robust Decomposition for Memory-Efficient Neural Operator Training	—	—
FlashOptim	arXiv 2026	FlashOptim: Optimizers for Memory-Efficient Training	official	—
Rose	GitHub 2026	Rose: Range-Of-Slice Equilibration optimizer	official	—
SAGE	ACL 2026 Findings	SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization	—	—
BlockLLM	arXiv 2024	BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks	official	—
Natural GaLore	arXiv 2024	Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning	official	—
SLTrain	NeurIPS 2024	SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining	official	—
8-bit Muon	arXiv 2025	Effective Quantization of Muon Optimizer States	—	—
FFT-based Subspace Selection	ICLR 2026	FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models	official	—
FOAM	arXiv 2025	FOAM: Blocked State Folding for Memory-Efficient LLM Training	official	—
GaLore 2	arXiv 2025	GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection	—	—
GradientStabilizer	ICML 2026	GradientStabilizer: Fix the Norm, Not the Gradient	official	—
GUM	arXiv 2025	Unbiased Gradient Low-Rank Projection	—	—
I3S	NeurIPS 2025	Breaking the Frozen Subspace: Importance Sampling for Low-Rank Optimization in LLM Pretraining	—	—
LORENZA	TMLR 2026	LORENZA: Enhancing Generalization in Low-Rank Gradient LLM Training via Efficient Zeroth-Order Adaptive SAM	—	—
ProjFactor (VLoRP)	arXiv 2025	Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients	—	—
RSO	arXiv 2025	A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models	—	—
SCALE	ICML 2026	Memory-Efficient LLM Pretraining via Minimalist Optimizer Design	—	—
SlimAdam	arXiv 2025	When Can You Get Away with Low Memory Adam?	official	—
LoRA-Pre	ICLR 2026	Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation	official	—
Lotus	arXiv 2026	Lotus: Efficient LLM Training by Randomized Low-Rank Gradient Projection with Adaptive Subspace Switching	—	—
POET-X	ICML 2026	POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation	official	—
MuonQ	arXiv 2026	MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization	official	—
4-bit-Muon-GRASP	ICLR 2026	Achieving low-bit Muon through subspace preservation and grid quantization	official	—
IO-Adam	OpenReview 2026	IO-Adam: Rethinking Memory-Efficient Adaptive Optimizers from Gradient Computation	—	—
H-Fac	AISTATS 2025	Memory-Efficient Optimization with Factorized Hamiltonian Descent	—	—
LiMuon	ICML 2026	LiMuon: Light and Fast Muon Optimizer for Large Models	—	—
M+Adam	OPT 2025: 17th Annual Workshop on Optimization for Machine Learning (NeurIPS 2025 Workshop)	M+Adam: Stable Low-Precision Training with Combined Adam–Madam Updates	—	—
SMET	ICML 2026	Memory-Efficient LLM Training with Dynamic Sparsity: From Stability to Practical Scaling	official	—
PowerStep	arXiv 2026	PowerStep: Memory-Efficient Adaptive Optimization via ell_p-Norm Steepest Descent	official	—
SRON	OpenReview 2025	SRON: State-free LLM Training via Row-wise Gradient Normalization	—	—
GradLite	arXiv 2025	Backward-Friendly Optimization: Training Large Language Models with Approximate Gradients under Memory Constraints	—	—
Optimal Low-Rank SGE	arXiv preprint 2026	Optimal low-rank stochastic gradient estimation for LLM training	—	—
Spectral Compact Training (SCT)	arXiv 2026	Spectral Compact Training: Pre-Training Large Language Models via Permanent Truncated SVD and Stiefel QR Retraction	official	—

Trainer integrations¶

HuggingFace transformers exposes many of these methods through the optim argument of TrainingArguments. Each string value below maps to a memory-efficient optimizer; all backing libraries except the built-in Adafactor must be installed separately.

`optim` value	Backing library
`adafactor`	transformers ships its own `Adafactor` implementation with relative-step and update-clipping options (Apache-2.0).
`adamw_bnb_8bit` / `adamw_8bit`	bitsandbytes AdamW with block-wise 8-bit quantized state (MIT).
`paged_adamw_8bit` / `paged_adamw_32bit`	bitsandbytes paged AdamW; optimizer state is paged between GPU and CPU memory (MIT).
`lion_8bit` / `lion_32bit` / `paged_lion_8bit` / `paged_lion_32bit`	bitsandbytes Lion, single momentum buffer, with 8-bit and paged variants (MIT).
`ademamix_8bit` / `paged_ademamix_8bit` / `paged_ademamix_32bit`	bitsandbytes AdEMAMix with 8-bit quantized and paged state (MIT).
`rmsprop_bnb_8bit`	bitsandbytes RMSprop with block-wise 8-bit quantized state (MIT).
`adamw_torch_4bit` / `adamw_torch_8bit`	torchao pure-PyTorch AdamW with 4-bit or 8-bit optimizer states (BSD-3-Clause).
`galore_adamw` / `galore_adamw_8bit` / `galore_adafactor` and `*_layerwise` variants	galore-torch, the official GaLore release (Apache-2.0).
`apollo_adamw` / `apollo_adamw_layerwise`	apollo-torch, the official APOLLO release (CC-BY-NC-4.0).
`lomo` / `adalomo`	lomo-optim, the official LOMO and AdaLomo release (MIT).