AdamMini¶
Implements Adam-mini, a memory-efficient Adam variant that assigns a single second-moment value, and hence a single learning rate, to each parameter block.
where the blocks \(b\) follow the model architecture: embedding and output layers keep Adam's coordinate-wise second moment, query and key projections use one block per attention head, fused QKV weights use one block per head and query group, and every remaining parameter tensor forms a single block. Weight decay is decoupled as in AdamW and disabled for normalization layers.
Note: The constructor takes the model itself rather than a parameter iterable, since the block partition is derived from parameter names. A plain iterable of tensors is also accepted; its entries are treated as unnamed, one block per tensor.
Reference: Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P. Kingma, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun, "Adam-mini: Use Fewer Learning Rates To Gain More", ICLR 2025. https://arxiv.org/abs/2406.16793