FZOO¶
Implements FZOO, a fast zeroth-order optimizer that estimates gradients from forward passes alone and adapts its step size by the standard deviation of the perturbed batch losses.
FZOO replaces backpropagation with a batched one-sided difference estimate. At each step it draws \(N\) Rademacher random vectors \(u_i \in \{\pm 1\}^d\), perturbs the parameters by a radius \(\epsilon\), and combines the loss differences \(l_i - l_0\) into a single gradient estimate. Dividing the update by the standard deviation \(\sigma_t\) of the batch losses yields larger steps in flat regions and smaller steps in steep ones, mirroring Adam-style adaptivity while keeping memory at the inference level. The paper proves this update is formally equivalent to a zeroth-order extension of normalized-SGD.
where \(\theta_t\) are the parameters, \(\eta_t\) the step size, \(\epsilon\) the perturbation radius, \(u_i\) the \(N\) i.i.d. Rademacher perturbation vectors, \(L(\cdot; B_t)\) the loss on mini-batch \(B_t\), \(g_t\) the one-sided gradient estimate, and \(\sigma_t\) the standard deviation of the \(N\) perturbed losses.
Reference: Sizhe Dang, Yangyang Guo, Yanjun Zhao, Haishan Ye, Xiaodong Zheng, Guang Dai, Ivor Tsang, "FZOO: Fast Zeroth-Order Optimizer for Fine-Tuning Large Language Models towards Adam-Scale Speed", arXiv 2025. https://arxiv.org/abs/2506.09034