ScheduleFree+¶
Implements ScheduleFree+, a learning-rate-free and schedule-free variant of AdamW that adds a Polyak step size.
ScheduleFree+ keeps the three-sequence structure of Schedule-Free learning: an averaged iterate \(x_t\), a raw optimizer iterate \(z_t\), and an evaluation point \(y_t\) that interpolates between them and is where the gradient is taken. On top of this it removes the need to tune a learning rate by setting the effective step from a Polyak rule, scaling the AdamW step by \(\max(0, F_t + I_t)\) divided by a bias-corrected exponential average of the \(\ell_1\) gradient norm (the \(\sqrt{\pi/2}\) factor converts the \(\ell_1\) norm to an \(\ell_2\) estimate). The interpolation weight \(\tilde\beta_t\) is annealed from \(\beta_{\mathrm{sf}}\) toward \(\beta_{\mathrm{sf}}^{\max}\) over \(T_{\mathrm{anneal}}\) steps, and weight decay follows the decoupled AdamC form (scaled by \(\alpha_t^2\)). The returned parameters are the averaged iterate \(x_t\).
where \(\theta\) are the parameters (returned as \(x_t\)), \(\gamma_t\) the warmup factor, \(\alpha_t\) the Polyak effective step, \(g_t\) the gradient at \(y_{t-1}\), \(m_t,v_t\) the Adam moments with decays \(\beta_1,\beta_2\) and bias-corrected forms \(\hat{m}_t,\hat{v}_t\), \(\lambda\) the decoupled weight decay, \(\epsilon\) a stability constant, \(\beta_p\) the EMA coefficient for the \(\ell_1\) gradient-norm estimate \(\hat{e}_t\), \(\tilde\beta_t\) the annealed interpolation weight, and \(c_t = w_t / W_t\) the averaging weight with \(w_t = t^r\, \gamma_{\max}^{\,p}\) (and \(c_t = 1\) during the first \(C_{\mathrm{warmup}}\) steps).
Reference: Aaron Defazio, "ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models", arXiv 2026. https://arxiv.org/abs/2605.19095