Addax¶
Implements Addax, a memory-efficient fine-tuning method that mixes zeroth-order and first-order gradient estimates.
Addax computes a true first-order gradient on one minibatch using in-place SGD (where each layer's gradient is consumed and discarded right after it is produced, so the full gradient is never materialized) and a zeroth-order estimate on another minibatch via the SPSA finite-difference rule along a random direction \(z\). The two estimates are blended by a single coefficient \(\alpha\): the zeroth-order term cuts memory while the first-order term recovers the convergence speed and accuracy that pure zeroth-order methods like MeZO lack.
where \(\theta\) are the parameters, \(\eta\) the learning rate, \(g^{0}_t\) the scalar zeroth-order directional derivative along the random direction \(z\), \(g^{1}_t\) the first-order gradient, \(\epsilon\) the perturbation scale, \(\mathcal{B}^{0}\) and \(\mathcal{B}^{1}\) the zeroth-order and first-order minibatches, and \(\alpha \in [0,1]\) the coefficient balancing the two estimates.
Reference: Zeman Li, Xinwei Zhang, Peilin Zhong, Yuan Deng, Meisam Razaviyayn, Vahab Mirrokni, "Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models", ICLR 2025. https://arxiv.org/abs/2410.06441