AdaMeZO¶
Implements AdaMeZO, an Adam-style zeroth-order optimizer that recovers first- and second-moment estimates without storing them in memory.
AdaMeZO fine-tunes large models using only forward passes. As in MeZO, the gradient is replaced by the SPSA estimate along a single random Gaussian direction \(z_t\), giving a rank-1 reconstruction that costs two forward passes per step. To gain Adam's curvature awareness without tripling memory, AdaMeZO does not keep the moments \(m_t,v_t\) in memory. Instead it caches only the per-step random seeds (or PRNG states) and the scalar projected gradients, then unrolls the exponential moving averages into a truncated sum over a finite horizon \(h\): gradients older than \(h\) steps are discarded, and the surviving terms are regenerated on the fly by replaying the cached random streams. The truncated moments are reconstructed block-wise so the model can be updated in place.
where \(\theta \in \mathbb{R}^d\) are the parameters, \(\mathcal{L}(\cdot; B_t)\) is the loss on minibatch \(B_t\), \(z_t\) is an i.i.d. standard Gaussian perturbation, \(\mu\) is the perturbation scale, \(\eta\) is the learning rate, \(g_t\) is the SPSA gradient estimate (the scalar finite-difference quotient times \(z_t\)), \(\odot\) is the elementwise product, \(\beta_1,\beta_2 \in (0,1)\) are the moment decay rates, \(h\) is the finite moment horizon beyond which old gradients are truncated, and \(\epsilon\) is a small stability constant. The moments are never stored; they are recomputed each step by replaying cached seeds and projected gradients.
Reference: Zhijie Cai, Haolong Chen, Guangxu Zhu, "AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments", arXiv 2026. https://arxiv.org/abs/2605.00650