Magma¶
Implements Magma, a drop-in wrapper that masks an adaptive optimizer's updates and modulates them by momentum-gradient alignment.
Magma builds on the observation that randomly masking parameter updates acts as a curvature-dependent regularizer that smooths the optimization trajectory. Rather than masking uniformly, Magma scales the surviving updates by how well the current stochastic gradient agrees with the accumulated momentum: a high cosine similarity keeps the update near full strength, while a poorly aligned gradient is suppressed. Parameters are partitioned into disjoint blocks \(b\), and a Bernoulli mask plus an alignment score are applied per block on top of the update \(\Delta_t^{(b)}\) produced by a base optimizer (Adam, RMSProp, LaProp, or Muon), whose moments are always updated densely.
where \(\theta^{(b)}\) are the parameters of block \(b\), \(g_t^{(b)}\) the stochastic gradient, \(m_t^{(b)}\) the first-moment (momentum) estimate, \(\Delta_t^{(b)}\) the update direction from the base optimizer, \(\mathrm{cossim}\) the cosine similarity, \(\tau>0\) a temperature (\(\tau=2\) in experiments), \(s_t^{(b)}\) the EMA-smoothed alignment score, and \(z_t^{(b)}\) an independent Bernoulli\((0.5)\) mask drawn per block each step.
Reference: Taejong Joo, Wenhan Xia, Cheolmin Kim, Ming Zhang, Eugene Ie, "On Surprising Effectiveness of Masking Updates in Adaptive Optimizers", arXiv 2026. https://arxiv.org/abs/2602.15322