QZO¶
Implements QZO, zeroth-order fine-tuning of a quantized network through its continuous quantization scales.
The quantized weights \(\bar{\theta}\) stay frozen at their discrete values, so no high-precision weight copy or backpropagation through the quantizer is needed. Instead, QZO treats the per-group quantization scales \(\Delta\) as the only trainable parameters and estimates their gradient with a two-point SPSA scheme (Q-SPSA): the scales are perturbed symmetrically by \(\pm \epsilon z\) along a random Gaussian direction, and the resulting change in loss yields a scalar directional derivative \(d\).
To stabilize updates when the loss landscape over scales is sharp, the directional derivative is clipped to \([-C, C]\) before forming the projected gradient \(d\, z\). A ZO-SGD step then descends along this estimate, and a \(\max(\cdot, 0)\) projection keeps the scales non-negative.
where \(\bar{\theta}\) are the frozen quantized weights, \(\Delta\) are the trainable quantization scales, \(z \sim \mathcal{N}(0, I)\) is the random perturbation direction, \(\epsilon\) is the perturbation scalar, \(\odot\) is the dequantization (scale-times-weight) product, \(\mathcal{B}\) is the sampled mini-batch, \(C\) is the directional-derivative clipping threshold, and \(\gamma\) is the learning rate. The projected gradient estimate is \(\hat{\nabla}_{\Delta}\mathcal{L} = d\, z\).
Reference: Sifeng Shang, Jiayi Zhou, Chenyu Lin, Minxian Li, Kaiyang Zhou, "Fine-tuning Quantized Neural Networks with Zeroth-order Optimization", arXiv 2025. https://arxiv.org/abs/2505.13430