SRON¶
Implements SRON (SGD with Row-wise Normalization), a state-free optimizer that rescales each row of a 2-D gradient by its own root-mean-square before a plain SGD step.
The method is motivated by the observation that row-wise gradient scales in LLM training, especially in the attention module, vary widely: flattening the gradient into a single vector for global normalization lets one large dimension suppress updates in the others. Instead of storing any moments, SRON forms a diagonal matrix \(V_t\) from the per-row second moment of the current gradient and left-multiplies the gradient by it, so each row is normalized to a comparable magnitude. With no first- or second-moment buffers, it eliminates optimizer state memory entirely.
where \(W\) is the weight matrix with rows indexed by \(i\) and columns by \(j\), \(G_t\) is the batch gradient, \(V_t=\mathrm{diag}(\cdot)\) is the diagonal row-wise normalizer (one entry per row), \(\eta_t\) is the learning rate, \(\alpha\) is a scaling coefficient, and \(\epsilon>0\) is a numerical-stability constant.
Reference: Ziqing Wen, Yanqi Shi, Jiahuan Wang, Ping Luo, Linbo Qiao, Dongsheng Li, Tao Sun, "SRON: State-free LLM Training via Row-wise Gradient Normalization", ICLR 2026 submission. https://openreview.net/forum?id=BtQLBWr6zI