SinkGD¶
Implements SinkGD, stateless gradient descent that Sinkhorn-normalizes each weight-matrix gradient before the step.
SinkGD treats the per-layer gradient as a matrix and applies a few iterations of a row/column rescaling (a Sinkhorn-like procedure, SR-Sinkhorn) so that, in the limit, every row and every column has a fixed \(\ell_2\) norm. Each pass divides the matrix by its row norms (scaled by \(\sqrt{n}\)), then by its column norms (scaled by \(\sqrt{m}\)); \(L\) such alternating passes give the normalized gradient \(\hat g_t\), which is then used for a plain SGD step. The method keeps no optimizer state, so its memory footprint matches SGD while reaching Adam-comparable performance on LLM training.
where \(\theta\) are the parameters, \(g_t\) the gradient reshaped to an \(m\times n\) matrix, \(\eta_t\) the learning rate, \(L\) the number of SR-Sinkhorn iterations, \(Q(X) = \mathrm{diag}(\lVert X_{1,:}\rVert_2,\dots,\lVert X_{m,:}\rVert_2)\) the diagonal of row \(\ell_2\) norms, and \(R(X) = \mathrm{diag}(\lVert X_{:,1}\rVert_2,\dots,\lVert X_{:,n}\rVert_2)\) the diagonal of column \(\ell_2\) norms. SinkGD maintains no moments, weight decay, or \(\epsilon\) term.
Reference: Meyer Scetbon, Chao Ma, Wenbo Gong, Edward Meeds, "Gradient Multi-Normalization for Stateless and Scalable LLM Training", arXiv 2025. https://arxiv.org/abs/2502.06742