Memory-Efficient Optimizers¶
Memory-efficient optimizers reduce the optimizer-state memory that dominates large-model training budgets, where Adam-style methods store two extra full-precision values per parameter. The methods below cover factored second moments, 8-bit and 4-bit state quantization, low-rank gradient projection, block-coordinate updates, zeroth-order gradient estimates, and stateless update rules.
Trainer integrations¶
HuggingFace transformers exposes many of these methods through the optim argument of TrainingArguments. Each string value below maps to a memory-efficient optimizer; all backing libraries except the built-in Adafactor must be installed separately.
optim value |
Backing library |
|---|---|
adafactor |
transformers ships its own Adafactor implementation with relative-step and update-clipping options (Apache-2.0). |
adamw_bnb_8bit / adamw_8bit |
bitsandbytes AdamW with block-wise 8-bit quantized state (MIT). |
paged_adamw_8bit / paged_adamw_32bit |
bitsandbytes paged AdamW; optimizer state is paged between GPU and CPU memory (MIT). |
lion_8bit / lion_32bit / paged_lion_8bit / paged_lion_32bit |
bitsandbytes Lion, single momentum buffer, with 8-bit and paged variants (MIT). |
ademamix_8bit / paged_ademamix_8bit / paged_ademamix_32bit |
bitsandbytes AdEMAMix with 8-bit quantized and paged state (MIT). |
rmsprop_bnb_8bit |
bitsandbytes RMSprop with block-wise 8-bit quantized state (MIT). |
adamw_torch_4bit / adamw_torch_8bit |
torchao pure-PyTorch AdamW with 4-bit or 8-bit optimizer states (BSD-3-Clause). |
galore_adamw / galore_adamw_8bit / galore_adafactor and *_layerwise variants |
galore-torch, the official GaLore release (Apache-2.0). |
apollo_adamw / apollo_adamw_layerwise |
apollo-torch, the official APOLLO release (CC-BY-NC-4.0). |
lomo / adalomo |
lomo-optim, the official LOMO and AdaLomo release (MIT). |