8-bit Optimizers¶
Implements 8-bit Optimizers, stateful optimizers (Adam, Momentum) whose moments are stored in 8 bits via block-wise dynamic quantization.
The optimizer state (the running moments \(m_t\) and \(v_t\)) is the dominant memory cost of stateful methods. Here each state tensor is kept in 8 bits and only its non-quantized 32-bit form is materialized transiently during the update. The state tensor is split into contiguous blocks of \(B = 2048\) elements; each block is normalized by its own absolute maximum and mapped to an 8-bit index against a fixed dynamic-quantization codebook \(Q^{\mathrm{map}}\). Block-wise normalization confines outliers to a single block, so it does not degrade the precision of all other blocks. At each step the 8-bit states are dequantized, the standard 32-bit update is performed, and the new states are quantized back to 8 bits.
For the \(b\)-th block with normalization constant \(N_b\), quantization, dequantization, and the (Adam) update applied in dequantized space are
where \(T_b\) is a block of an optimizer state tensor, \(N_b\) its block-wise normalization constant, \(T_{bi}^{Q}\) the stored 8-bit index, \(Q^{\mathrm{map}}\) the dynamic-quantization codebook with \(2^n=256\) entries, \(T_{bi}^{D}\) the dequantized 32-bit value used for the update, \(g_t\) the gradient, \(m_t,v_t\) the dequantized first and second moments, \(\beta_1,\beta_2\) the decay rates, \(\gamma\) the learning rate, and \(\epsilon\) a stability constant; \(m_t\) and \(v_t\) (or, for 8-bit Momentum, the single momentum buffer) are re-quantized to 8 bits after the update.
Reference: Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer, "8-bit Optimizers via Block-wise Quantization", ICLR 2022. https://arxiv.org/abs/2110.02861