Distributed Shampoo¶
Implements Distributed Shampoo, a full-matrix Kronecker-factored preconditioner with learning-rate grafting for at-scale data-parallel training.
For a matrix parameter \(\theta\in\mathbb{R}^{m\times n}\), Shampoo keeps left and right second-order statistics \(L_t\in\mathbb{R}^{m\times m}\) and \(R_t\in\mathbb{R}^{n\times n}\) as exponential moving averages of \(g_t g_t^{\top}\) and \(g_t^{\top} g_t\), and preconditions the gradient by their inverse fourth roots. Because the raw Shampoo direction has no natural scale, the step length is grafted from a diagonal AdaGrad/Adam preconditioner: the Shampoo direction is rescaled to the Frobenius norm of the grafting direction. Nesterov-style momentum and decoupled (AdamW) weight decay are then applied to produce the update.
where \(\theta\) are the parameters, \(g_t\) the gradient reshaped to a matrix, \(\eta\) the learning rate, \(\beta_2\) the second-moment decay, \(\mu\) the momentum coefficient, \(\lambda\) the decoupled weight-decay coefficient, \(\epsilon\) the regularization/damping constant, \(L_t,R_t\) the left and right Kronecker factors with \(L_t^{-1/4},R_t^{-1/4}\) their inverse fourth roots (via eigendecomposition), \(A_t\) the diagonal grafting state, \(\odot\) the elementwise product, \(\lVert\cdot\rVert_F\) the Frobenius norm, and \(M_t\) the momentum buffer.
Reference: Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, Michael Rabbat, "A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale", ACM Transactions on Mathematical Software 2023. https://arxiv.org/abs/2309.06497