ASGO¶
Implements ASGO (Adaptive Structured Gradient Optimization), a one-sided full-matrix adaptive method for matrix-shaped parameters.
ASGO treats the gradient of a weight matrix as a structured object rather than a flat vector. Instead of a diagonal preconditioner over individual entries (as in Adam), it accumulates the full outer product \(G_t G_t^\top\) to capture row-space curvature, then preconditions the gradient on the left by the inverse square root of this matrix. This is the single-sided analogue of Shampoo: one preconditioner of the parameter's row dimension rather than two, which keeps the curvature estimate full-matrix along one axis while staying cheaper than a two-sided method.
where \(W_t \in \mathbb{R}^{m \times n}\) is the parameter matrix, \(G_t\) its gradient, \(V_t \in \mathbb{R}^{m \times m}\) the accumulated left preconditioner from the gradient outer products, \(V_t^{1/2}\) the matrix square root, \(\Lambda_t^{-1}\) the inverse preconditioner, \(\eta_t\) the learning rate, \(\epsilon\) a stability constant, and \(I\) the identity. The practical variant replaces the plain accumulations with exponential moving averages (momentum \(M_t = \beta_1 M_{t-1} + (1-\beta_1) G_t\) and \(V_t = \beta_2 V_{t-1} + (1-\beta_2) G_t^\top G_t\)), recomputes the inverse square root every \(\tau\) steps for efficiency, and applies the preconditioner on the right.
Reference: Kang An, Yuxing Liu, Rui Pan, Shiqian Ma, Donald Goldfarb, Tong Zhang, "ASGO: Adaptive Structured Gradient Optimization", arXiv 2025. https://arxiv.org/abs/2503.20762