APMSqueeze¶
Implements APMSqueeze, a communication-efficient Adam-preconditioned momentum SGD with error-compensated 1-bit gradient compression.
Adam's per-coordinate variance preconditioner is nonlinear, which breaks the error-compensation analysis that makes 1-bit compression work for plain SGD. APMSqueeze sidesteps this in two phases. In a short warmup it runs ordinary Adam for \(T_w\) steps to learn a good second-moment estimate, then freezes the variance at \(v_{T_w}\). In the second phase it switches to momentum SGD preconditioned by the frozen \(v_{T_w}\), so the only quantity that has to be communicated is the momentum, which can now be safely 1-bit compressed with local and global error feedback.
For \(t \le T_w\) each worker runs standard Adam (updating both \(m_t\) and \(v_t\)). For \(t > T_w\) the variance is held fixed and each worker \(i\) performs:
where \(\theta\) are the parameters, \(\gamma_t\) the learning rate, \(g_t^{(i)}\) the local gradient on worker \(i\) (of \(n\) workers), \(m_t\) the momentum with decay \(\beta_1\), \(v_{T_w}\) the second-moment estimate frozen after \(T_w\) Adam warmup steps, \(C_\omega[\cdot]\) the 1-bit compression operator, \(\delta_t^{(i)}\) and \(\bar{\delta}_t\) the local and global accumulated compression-error terms, and \(\oslash\) element-wise division.
Reference: Hanlin Tang, Shaoduo Gan, Samyam Rajbhandari, Xiangru Lian, Ji Liu, Yuxiong He, Ce Zhang, "APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm", arXiv 2020. https://arxiv.org/abs/2008.11343