Power Decay / Warmup-Stable-Decay (WSD)¶
Implements Power Decay and Warmup-Stable-Decay (WSD), optimal learning-rate schedules derived from functional scaling laws.
These schedules are the provably optimal learning-rate trajectories under a functional scaling-law analysis of the training loss, where the optimal form is governed by two task exponents: a source exponent \(s\) (smaller means a harder task) and a capacity exponent \(\beta\) (smaller means higher model capacity). In the easy-task regime (\(s \ge 1 - 1/\beta\)) the optimal schedule is a single power decay from a peak rate to zero. In the hard-task regime (\(s < 1 - 1/\beta\)) the optimal schedule is Warmup-Stable-Decay: hold the rate at the maximum stable value, then power-decay over a vanishing terminal fraction of training. Both share the same decay exponent \(2\beta - 1\).
where \(t\) is the training step, \(T\) the total training horizon, \(T_1\) the breakpoint where the decay phase begins, \(\eta_{\text{peak}}\) the peak learning rate, \(\eta_{\text{stab}}\) the maximum stable learning rate, and \(\beta > 1\) the capacity exponent setting the decay power \(2\beta - 1\). The decay fraction \((T - T_1)/T \to 0\) as \(T\) grows.
Reference: Binghui Li, Zilin Wang, Fengling Chen, Shiyang Zhao, Ruiheng Zheng, Lei Wu, "Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay", arXiv 2025. https://arxiv.org/abs/2602.06797