LeZO¶
Implements LeZO, a layer-wise sparse zeroth-order optimizer for memory- and compute-efficient fine-tuning of large language models.
LeZO builds on the MeZO-style SPSA estimator, which approximates the gradient from two forward passes along a single random direction \(z\), removing the need to store activations or backpropagate. To cut the perturbation and update cost, LeZO treats whole layers as the unit of sparsity: at each step it randomly keeps a fraction of the layers and zeros the perturbation on the rest, so only the retained parameters are estimated and updated. Because a different layer subset is drawn each step (seeded by \(s_t\)), full-parameter coverage is still achieved over the course of training.
where \(\theta\) are the parameters, \(\eta\) the learning rate, \(\mathcal{B}_t\) the minibatch, \(\epsilon\) the perturbation scale, and \(z_t\) a standard Gaussian direction. The masking operator \(\mathcal{R}(z_t, \rho, s_t)\) keeps a \((1-\rho)\) fraction of the layers (selected randomly via seed \(s_t\)) and sets the perturbation to zero on the sparsified layers, so \(\hat{g}_t\) is nonzero only on the retained parameters.
Reference: Fei Wang, Li Shen, Liang Ding, Chao Xue, Ye Liu, Changxing Ding, "Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models", arXiv 2024. https://arxiv.org/abs/2410.09823