AdaDiag¶
Implements AdaDiag, an adaptive method that diagonalizes the preconditioner by rotating gradients into the singular-vector basis.
The diagonal second moment of Adam implicitly assumes the gradient's covariance is diagonal, which fails for the structured (matrix-shaped) gradients of neural layers. AdaDiag instead periodically computes a singular value decomposition of the gradient matrix \(G_t = P_t \Sigma_t Q_t^{\top}\) and projects the gradient into the rotated basis \(P_t^{\top} G_t\) before forming the moments. In that basis the gradient covariance is closer to diagonal, so the per-coordinate Adam statistics \(M_t,V_t\) approximate a full preconditioner far better; the resulting update is rotated back into the original coordinates. A one-sided variant rotates only by \(P_t\), while a two-sided variant rotates by both \(P_t\) and \(Q_t\). The SVD is recomputed every \(T\) steps and the rotation matrices are reused in between.
where \(W\) (i.e. \(\theta\)) are the layer's weight matrix, \(G_t\) its gradient, \(P_t,Q_t\) the left/right singular-vector rotation matrices, \(\tilde G_t\) the projected gradient, \(M_t,V_t\) the first- and second-moment estimates of the projected gradient (with bias-corrected decays \(\beta_1,\beta_2\)), \(\eta_t\) the learning rate, \(\lambda\) the decoupled weight decay, \(\epsilon\) the stability constant, and \(T\) the SVD recomputation period.
Reference: Son Nguyen The, Bo Liu, Lizhang Chen, Qiang Liu, "Improving Adaptive Moment Optimization via Preconditioner Diagonalization", 2025. https://arxiv.org/abs/2502.07488