Adahessian¶
Implements AdaHessian, an adaptive second-order optimizer.
AdaHessian replaces the squared-gradient denominator of Adam with a running average of the squared diagonal of the Hessian, estimated with a Hutchinson matrix-free probe. For each step a Rademacher vector \(z\) (entries \(\pm 1\)) is drawn and the Hessian-vector product \(H_t z\) is formed by differentiating \(g_t^\top z\). The per-element magnitude \(|H_t z|\) is then block-averaged to reduce its variance, giving the block-averaged diagonal estimate \(D_t^{(s)}\). With first moment \(m_t\), second moment \(v_t\) over \(D_t^{(s)}\), learning rate \(\eta\), decay rates \(\beta_1\), \(\beta_2\), and Hessian power \(k\):
where \(\lambda\) is the weight_decay and \(b\) is the number of
elements in each structured block. The per-element magnitude
\(|H_t z|\) is averaged (not the signed product) over each block of
size \(b\): a 2D Conv kernel is averaged over its spatial extent,
matching the block-diagonal averaging of the paper. Setting \(k = 1\)
recovers the standard Hessian power; \(k = 0.5\) is a milder
preconditioner.
Note: AdaHessian needs the Hessian-vector product, so the gradients passed
to step must carry an autograd graph. Call loss.backward(
create_graph=True) before step (or pass a closure that does so).
Without create_graph=True the gradients have no grad_fn and
step raises. Sparse gradients are not supported.
Reference: Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, Michael W. Mahoney, "ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning", AAAI 2021. https://arxiv.org/abs/2006.00719