1-bit Adam¶
Implements 1-bit Adam, a communication-efficient Adam that freezes the second moment after a warmup phase and transmits an error-compensated 1-bit compression of the momentum.
The paper observes that Adam's second moment \(v_t\) stabilizes early in training, while the nonlinear dependence of the update on \(v_t\) blocks the error-feedback compression that works for plain momentum SGD. 1-bit Adam runs vanilla Adam for a warmup of \(T_w\) steps, then freezes \(v_{T_w}\) and switches to a compression phase: each worker compresses its momentum with a 1-bit (sign-based) operator \(C_\omega\), carries the residual forward as local error feedback \(\delta_t\), and the averaged compressed momenta are compressed again on the server, so only 1-bit messages cross the network while the frozen variance still preconditions the step.
where \(g_t\) is the gradient, \(m_t\) is the local first moment, \(v_{T_w}\) is the second moment frozen at the end of warmup, \(C_\omega[\cdot]\) is the error-compensated 1-bit compression operator, \(\delta_t\) and \(\bar{\delta}_t\) are the worker and server compression-error (feedback) terms, \(\hat{m}_t^{(i)}\) is the compressed momentum from worker \(i\), \(n\) is the number of workers, \(\beta_1, \beta_2 \in [0,1)\) are the decay rates, \(\gamma\) is the learning rate, and \(\epsilon\) is a small constant for numerical stability. During the warmup phase (\(t \le T_w\)) the update reduces to standard Adam with \(v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2\).
Reference: Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He, "1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed", ICML 2021. https://arxiv.org/abs/2102.02888