RSGDM¶
Implements RSGDM, SGD with momentum corrected by a differential (gradient-change) estimate.
SGDM estimates the overall gradient with an exponential moving average of \(g_t\), which is biased and lags behind the true gradient. RSGDM additionally tracks the exponential moving average of the gradient difference \(\Delta g_t = g_t - g_{t-1}\) and adds it, scaled by \(\beta\), to the usual momentum. This differential term corrects the bias and reduces the lag without introducing any new hyperparameter.
where \(\theta\) are the parameters, \(\eta\) is the learning rate, \(g_t\) is the gradient, \(\beta\) is the momentum decay, \(m_t\) is the gradient EMA, \(z_t\) is the EMA of the gradient difference \(\Delta g_t\), and \(n_t\) is the differentially corrected gradient estimate used in the update.
Reference: Honglin Qin, Hongye Zheng, Bingxing Wang, Zhizhong Wu, Bingyao Liu, Yuanfang Yang, "Reducing Bias in Deep Learning Optimization: The RSGDM Approach", arXiv 2024. https://arxiv.org/abs/2409.15314