Conda¶
Implements Conda (Column-Normalized Adam), a subspace optimizer that combines Muon-style orthogonal projection with Adam-style coordinate-wise adaptivity.
The first moment \(m_t\) is accumulated as in Adam, then periodically an SVD of \(m_t\) supplies a left-singular basis \(U_t\) that defines a low-dimensional column subspace, refreshed only every \(T\) steps and reused in between. Both the momentum and the raw gradient are projected into this subspace, and a second moment \(v_t\) is maintained on the projected gradient. Normalizing the projected momentum column-wise by \(\sqrt{v_t}\) recovers Adam's per-coordinate adaptivity inside the conditioned subspace before mapping the update back with \(U_t\).
where \(\theta\) are the (matrix-shaped) parameters, \(g_t\) the gradient, \(\eta\) the learning rate, \(m_t\)/\(v_t\) the first and second moments, \(\beta_1,\beta_2\) the decay rates, \(\epsilon\) the stability constant, \(U_t\) the left singular vectors of \(m_t\) refreshed every \(T\) subspace-update steps, and \((\cdot)^2\), division, and \(\sqrt{\cdot}\) are element-wise.
Reference: Junjie Wang, Pan Zhou, Yiming Dong, Huan Li, Jia Li, Xun Zhou, Qicheng Lao, Cong Fang, Zhouchen Lin, "Conda: Column-Normalized Adam for Training Large Language Models Faster", arXiv 2025. https://arxiv.org/abs/2509.24218