EMA bias-corrected iterate averaging¶
Implements BEMA, a bias-corrected exponential moving average of the iterate trajectory.
Plain EMA of the parameters smooths the optimization trajectory but lags behind the current iterate, since recent updates are underweighted. BEMA adds a debiasing term proportional to the displacement \(\theta_t - \theta_0\) from a periodically refreshed snapshot, which cancels the lag of the EMA estimate. Both the EMA decay \(\beta_t\) and the correction weight \(\alpha_t\) anneal with the step count, mirroring the maximum-likelihood estimator in the quadratic setting. The snapshot \(\theta_0\) and both averages are reset throughout a burn-in of \(\tau\) steps, after which BEMA is refreshed every \(\phi\) steps.
where \(\theta_t\) is the optimizer's iterate at step \(t\), \(\theta_0\) is the most recent snapshot (refreshed during burn-in), \(\hat{\mu}^{\mathrm{EMA}}_t\) is the running EMA, \(\hat{\mu}_t\) is the bias-corrected average returned, \(\kappa\) is the EMA power, \(\eta\) is the bias power, \(\gamma\) is the multiplier, and \(\rho\) is the lag offset; defaults are \(\kappa = 0.5\), \(\eta = 0.2\), \(\gamma = 1.0\), \(\rho = 10\).
Reference: Adam Block, Cyril Zhang, "EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes", arXiv 2025. https://arxiv.org/abs/2508.00180