APOLLO¶
Implements APOLLO, a memory-efficient AdamW variant that replaces element-wise gradient scaling with channel-wise or tensor-wise factors estimated in a low-rank space under random projection.
where the projection \(P \in \mathbb{R}^{r \times m}\) is resampled
every update_proj_gap steps and \(\alpha\) is scale. A
norm-growth limiter caps the ratio of consecutive scaled-gradient norms
at \(\gamma = 1.01\). scale_type='channel' gives APOLLO;
scale_type='tensor' with a small rank (1 in the paper) gives
APOLLO-Mini. With rank=None no projection is applied and the update
reduces to AdamW.
Note: Following the upstream reimplementation, the scaling factors are applied to the projected gradient \(R_t\) and the update is mapped back through \(P^\top\) scaled by \(\alpha^{3/2}\), rather than scaling the full-rank gradient \(G_t\) directly as in Algorithm 1 of the paper.
Reference: Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, Jinwon Lee, "APOLLO: SGD-like Memory, AdamW-level Performance", MLSys 2025. https://arxiv.org/abs/2412.05270