SophiaG¶
Implements Sophia (Gauss-Newton-Bartlett variant), a second-order clipped stochastic optimizer.
Sophia preconditions the gradient with a moving average of a light-weight diagonal Hessian estimate and clips the result element-wise, which bounds the worst-case update along any coordinate. With first moment \(m_t\), diagonal Hessian estimate \(h_t\), learning rate \(\eta\), decay rates \(\beta_1\), \(\beta_2\), and pre-conditioner coefficient \(\rho\) (the paper's \(\gamma\)), with the per-coordinate clip applied to magnitude 1:
where \(\lambda\) is the decoupled weight_decay and \(B\) is the
bs (batch size) passed to step. The Hessian estimate
\(h_t\) is refreshed every \(k\) steps by update_hessian.
The Gauss-Newton-Bartlett estimator forms \(\hat{h}_t\) from the
per-coordinate squared gradient of a loss evaluated on labels sampled from
the model's own predictive distribution; the batch-size factor \(B\) is
applied here in the denominator rather than folded into
\(\hat{h}_t\), following the official implementation. The clip operates
per coordinate, so the effective step never exceeds \(\eta\) in
magnitude.
Note: Sophia requires a periodic Hessian refresh. Call
update_hessian every k steps after a backward pass on a sampled
loss (a closure), then call step. The bs argument to
step is the batch size used to scale the estimator. Until the first
update_hessian call the estimate is zero and every update saturates
the clip, reducing the step to \(-\eta\,\mathrm{sign}(m_t)\).
Reference: Hong Liu, Zhiyuan Li, David Hall, Percy Liang, Tengyu Ma, "Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training", ICLR 2024. https://arxiv.org/abs/2305.14342