4-bit Shampoo¶
Implements 4-bit Shampoo, Shampoo with its preconditioners and their inverse roots stored at 4-bit precision.
The base method is Shampoo: per layer it keeps left and right second-order statistics \(L_t, R_t\) and preconditions the gradient by their inverse fourth roots. To save memory, 4-bit Shampoo does not quantize the matrices \(L_t, R_t\) directly. It eigendecomposes a symmetric matrix \(A=U\Lambda U^{\top}\) and quantizes only the eigenvector matrix \(U\) to 4-bit while keeping the eigenvalues \(\Lambda\) in full precision; this keeps the quantized factor close to orthogonal, which the linear (preconditioner) state is far from. On dequantization the eigenvectors are re-orthogonalized by a few Björck iterations, and the inverse fourth root is rebuilt from the rectified eigenvectors and the eigenvalues. The preconditioned gradient is rescaled to match the original gradient norm and then consumed by a first-order optimizer (SGDM or AdamW).
where \(\theta\) are the parameters, \(G_t\) the gradient (reshaped to a matrix), \(\beta\) the statistics decay rate, \(\epsilon\) a damping constant, \(Q\) the 4-bit quantizer and \(D\) its dequantizer, \(V^{(k)}\) the Björck orthonormalization restoring the dequantized eigenvectors over \(k\) iterations, \(\hat L_t,\hat R_t\) the quantized inverse-fourth-root preconditioners obtained by applying the \(\hat A\) construction to \(L_t\) and \(R_t\), \(\tilde G_t\) the norm-rescaled preconditioned gradient, \(s_t\) the inner first-order state, and \(F\) the wrapped first-order optimizer (SGDM or AdamW) carrying the learning rate \(\eta\) and weight decay \(\lambda\).
Reference: Sike Wang, Pan Zhou, Jia Li, Hua Huang, "4-bit Shampoo for Memory-Efficient Network Training", NeurIPS 2024. https://arxiv.org/abs/2405.18144