SOAP¶
Implements SOAP, Shampoo with Adam in the Preconditioner's eigenbasis.
SOAP keeps a Shampoo preconditioner for each tensor dimension as a running
average of one-sided gradient outer products. Let \(L_t\) and
\(R_t\) be these preconditioners for a matrix parameter, with
eigenbases \(Q_L\) and \(Q_R\) refreshed every
precondition_frequency steps. The gradient is rotated into that
eigenbasis, Adam runs on the rotated coordinates, and the resulting update
is rotated back:
where \(m_t\) and \(v_t\) are the Adam moments of the rotated
gradient, \(\mathrm{bc}_1 = 1 - \beta_1^t\) and
\(\mathrm{bc}_2 = 1 - \beta_2^t\) are the bias-correction terms folded
into the scalar step size. Following the HuggingFace AdamW convention,
\(\epsilon\) is added to the un-bias-corrected second moment
\(\sqrt{v_t}\) (rather than to \(\sqrt{\hat{v}_t}\)), which yields
an effective denominator of
\(\sqrt{\hat{v}_t} + \epsilon / \sqrt{\mathrm{bc}_2}\). The final
term \(-\gamma\,\lambda\,\theta_{t-1}\) is decoupled weight decay
(\(\lambda\) = weight_decay), applied after the gradient step and
scaled by the raw learning rate \(\gamma\).
Reference: Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, Sham Kakade, "SOAP: Improving and Stabilizing Shampoo using Adam for Language Modeling", ICLR 2025. https://arxiv.org/abs/2409.11321