AGZO¶
Implements AGZO, an activation-guided zeroth-order optimizer that confines perturbations to a low-rank activation subspace.
AGZO fine-tunes language models with forward passes only, in the spirit of MeZO, but replaces the isotropic Gaussian perturbation with a structured one. For each linear layer it first extracts an orthonormal basis \(A_\ell\) spanning the dominant directions of the layer's activation matrix \(H_\ell\) via a few steps of power iteration, then samples the perturbation inside that subspace as \(\Delta_\ell = R_\ell A_\ell^\top\). Nonlinear layers fall back to a plain Gaussian perturbation. Steering the probe directions toward where activations actually have energy reduces the variance of the gradient estimate.
The gradient is the one-sided forward-difference quotient: perturb in place by \(\mu \Delta_\ell\), evaluate the perturbed loss \(f_+\), and reconstruct a rank-structured gradient estimate by scaling \(\Delta_\ell\) with the scalar \(g = (f_+ - f_0)/\mu\). The descent step both undoes the perturbation and applies the update, so only the single scalar \(g\) and the random seeds need to be carried between evaluations.
where \(W_\ell\) are the weights of layer \(\ell\), \(f(W; B)\) is the loss on minibatch \(B\), \(A_\ell\) is the orthonormal activation-informed basis from power iteration on \(H_\ell H_\ell^\top\), \(R_\ell\) and \(u_\ell\) are i.i.d. standard Gaussian, \(\mu > 0\) is the perturbation scale, \(\eta\) is the learning rate, \(f_0\) and \(f_+\) are the unperturbed and perturbed losses, the term \(-\mu \Delta_\ell\) restores the parameters and \(-\eta\, g\, \Delta_\ell\) applies the step.
Reference: Wei Lin, Yining Jiang, Qingyu Song, Qiao Xiang, Hong Xu, "AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning", ICML 2026. https://arxiv.org/abs/2601.17261