Nora¶
Implements Nora, a scalable matrix optimizer that projects momentum onto the orthogonal complement of each weight row, then row-normalizes it.
Nora targets matrix-shaped parameters (e.g. Transformer weights). It keeps a single momentum buffer and, for each row of the weight matrix, removes the component of the momentum that is parallel to the corresponding weight row. This row-wise orthogonal projection stabilizes weight norms and angular velocities. Dividing each projected row by its own \(L_2\) norm yields a scale-invariant update that approximates structured preconditioning by exploiting the row block-diagonal dominance of the Transformer Hessian, all at \(O(mn)\) cost.
where \(\theta\) is the weight matrix with \(i\)-th row \(\theta_{t,i:}\), \(g_t\) is the gradient, \(m_t\) is momentum with decay \(\beta\), \(m_{t,i:}^{\perp}\) is the \(i\)-th momentum row projected onto the orthogonal complement of \(\theta_{t,i:}\), \(d_t\) is the row-normalized update, \(\eta\) is the learning rate, and \(\lambda\) is the weight decay.
Reference: Jinghui Yuan, Jiaxuan Zou, Shuo Wang, Yong Liu, Feiping Nie, "Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer", arXiv 2026. https://arxiv.org/abs/2605.03769