HELENE¶
Implements HELENE, a zeroth-order fine-tuning method combining a diagonal Hessian preconditioner with layer-wise clipping and gradient annealing.
HELENE estimates the gradient with simultaneous perturbation stochastic approximation (SPSA), avoiding any backward pass. It scales each coordinate by an EMA of a diagonal Hessian estimate (an asymptotic Gauss-Newton-Bartlett estimator refreshed every \(k\) steps), clipped per layer to bound the conditioning, and anneals the gradient contribution to the momentum over training to stabilize early steps.
where \(\theta\) are the parameters, \(\eta_t\) the learning rate, \(g_t\) the SPSA gradient estimate, \(\epsilon\) the perturbation scale (and the stability constant in the denominator), \(z\) a standard Gaussian perturbation, \(\mathcal{B}\) a minibatch, \(\alpha_t\) the annealing coefficient with horizon \(T\), \(m_t\) the annealed gradient EMA, \(\hat{h}_t\) the asymptotic Gauss-Newton-Bartlett diagonal Hessian estimate over a batch of size \(B\) (refreshed every \(k\) steps), \(h_t\) its EMA, \(\beta_1,\beta_2\) the decay rates, \(\gamma\) a scaling coefficient, and \(\lambda_i\) the per-layer clipping threshold applied to layer \(i\).
Reference: Huaqin Zhao, Jiaxi Li, Yi Pan, Shizhe Liang, Xiaofeng Yang, Wei Liu, Xiang Li, Fei Dou, Tianming Liu, Jin Lu, "HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-Tuning LLM with Zeroth-Order Optimization", arXiv 2024. https://arxiv.org/abs/2411.10696