SCSAdamW¶
Implements SCSAdamW, an AdamW variant that replaces the first moment with a stochastic conjugate subgradient direction.
Instead of an exponentially weighted first moment, SCSAdamW builds the search direction by optimally blending the previous direction \(d_{t-1}\) with the current gradient \(g_t\). The blend weight \(\lambda_t^\ast\) comes from a one-dimensional projected line search that minimizes the norm of the combined direction over the segment between \(d_{t-1}\) and \(g_t\), clamped to \([0,1]\). This conjugate direction is then bias-corrected, divided by the AdamW-style RMS of the gradient, and applied with decoupled weight decay.
where \(\theta\) are the parameters, \(\eta\) the learning rate, \(g_t\) the gradient, \(d_t\) the conjugate subgradient direction with line-search weight \(\lambda_t^\ast \in [0,1]\), \(v_t\) the second moment with decay \(\beta_2\), \(\Pi_{[0,1]}\) projection onto the unit interval, \(\lambda\) the decoupled weight decay, and \(\zeta\) a small stability constant.
Reference: Di Zhang, Yihang Zhang, "Beyond First-Order: Training LLMs with Stochastic Conjugate Subgradients and AdamW", arXiv preprint 2025. https://arxiv.org/abs/2507.01241