DiZO¶
Implements DiZO, a divergence-driven zeroth-order fine-tuning method that rescales the cumulative update layer by layer.
DiZO targets the slow convergence of memory-efficient zeroth-order (ZO) fine-tuning, where the gradient is estimated by finite differences of forward passes and every layer receives a uniform-magnitude step. Its core idea is divergence-driven layer adaptation: after running ordinary ZO-SGD steps, the displacement of each layer from the pretrained weights is rescaled by a learned per-layer projection so that updates carry diverse magnitudes matched to each layer's needs, rather than the single global scale that standard ZO imposes.
The base optimization uses the SPSA-style two-sided estimator and a plain SGD step. Periodically, the cumulative displacement \(\Delta\theta_t = \theta_t - \theta_0\) from the pretrained model \(\theta_0\) is reshaped per layer \(\ell\) by a projection scalar \(\gamma_t^{(\ell)}\), which is itself optimized with ZO:
where \(\theta\) are the parameters, \(\theta_0\) the pretrained weights, \(\eta\) the learning rate, \(\epsilon\) the perturbation scale, \(u_i\) random Gaussian directions, \(q\) the number of queries, \(\ell\) the layer index, \(\Delta\theta_t^{(\ell)}=\theta_t^{(\ell)}-\theta_0^{(\ell)}\) the per-layer displacement, and \(\gamma_t^{(\ell)}\) the per-layer projection scalar (re-initialized to \(\lVert\Delta\theta_t^{(\ell)}\rVert\) so the initial ratio is \(1\), then optimized by the same ZO estimator and SGD step, with the ratio clipped to \([1-\tau,1+\tau]\)).
Reference: Qitao Tan, Jun Liu, Zheng Zhan, Caiwen Ding, Yanzhi Wang, Jin Lu, Geng Yuan, "Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning", ICML 2025. https://arxiv.org/abs/2502.03304