TeZO¶
Implements TeZO (Tensorized Zeroth-Order), a zeroth-order fine-tuning method that factorizes the perturbation tensor across both the model and the temporal dimension.
Memory-efficient zeroth-order (ZO) fine-tuning estimates gradients from forward passes by perturbing each weight matrix with random noise, but storing or regenerating a fresh full-size direction at every step is wasteful. TeZO views the stacked perturbations over all \(T\) iterations as a single \(m\times n\times T\) tensor and approximates it with a rank-\(r\) canonical polyadic (CP) decomposition. For each weight matrix \(\theta\) of size \(m\times n\), the spatial factors \(U\in\mathbb{R}^{m\times r}\) and \(V\in\mathbb{R}^{n\times r}\) are sampled once and held fixed, while only a small temporal vector \(\tau_t\in\mathbb{R}^{r}\) is resampled each step. The per-step perturbation is then \(Z_t = U\,\mathrm{diag}(\tau_t)\,V^\top\), and the gradient is estimated by symmetric finite differences along \(Z_t\).
This collapses the cost of generating perturbations from \(\mathcal{O}(d\,T)\) to \(\mathcal{O}(d+T)\) random numbers per matrix, since the heavy \(m\times r\) and \(n\times r\) factors are reused across all iterations. The momentum and Adam variants (TeZO-m, TeZO-Adam) accumulate statistics only on the cheap temporal factors.
where \(\theta\) is a weight matrix of size \(m\times n\), \(\eta\) is the learning rate, \(\rho\) is the perturbation (smoothing) radius, \(r\ll\min\{m,n\}\) is the CP rank, \(U,V\) are the fixed spatial factors with columns \(u_s,v_s\), \(\tau_t\) is the per-step temporal factor, \(Z_t\) is the rank-\(r\) perturbation direction, \(\kappa_t\) is the scalar finite-difference coefficient from two forward passes, and \(f\) is the loss on the current minibatch.
Reference: Yan Sun, Tiansheng Huang, Liang Ding, Li Shen, Dacheng Tao, "TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs", arXiv 2025. https://arxiv.org/abs/2501.19057