SCALE¶
Implements SCALE (Stochastic Column-normAlized Last-layer momEntum), a minimalist memory-efficient optimizer for LLM pretraining.
SCALE strips Adam down to two ingredients. It drops second-order moments entirely and keeps first-order momentum only for the last layer, while every other layer uses the raw stochastic gradient. The update direction for each weight matrix is then column-normalized: each column is divided by its own L2 norm. This corresponds to steepest descent under the \(\|\cdot\|_{1\to 2}\) operator norm, which removes the per-parameter adaptive state of Adam yet preserves most of its conditioning benefit at near-SGD memory cost.
For a layer \(l\) with weight matrix \(\theta_l\) and gradient \(g_t\), the update is
where \(\theta_t\) are the layer parameters, \(\eta\) is the learning rate, \(g_t\) the stochastic gradient, \(m_t\) the first-order momentum, \(\beta\) the momentum decay, \(L\) the index of the last layer, and \(C(\cdot)\) the column-wise normalization that divides each column \(M_{:,j}\) by its Euclidean norm.
Reference: Athanasios Glentis, Jiaxiang Li, Andi Han, Mingyi Hong, "A Minimalist Optimizer Design for LLM Pretraining", arXiv 2025. https://arxiv.org/abs/2506.16659