StoSignSGD¶
Implements StoSignSGD, a sign-based optimizer that injects unbiased structural stochasticity into the sign operator.
SignSGD discards gradient magnitude and is known to diverge on non-smooth objectives. StoSignSGD replaces the deterministic sign with a stochastic sign operator: each coordinate is perturbed by uniform noise scaled by a per-coordinate buffer before taking the sign, so that in expectation the update recovers an anisotropic, magnitude-aware (preconditioned SGD) direction while still transmitting only one bit per coordinate. The buffer \(G_t\) tracks the running coordinate-wise maximum of the (momentum-smoothed) gradient magnitudes, setting the normalization level. The practical implementation adds heavy-ball momentum and decoupled weight decay.
where \(\theta\) are the parameters, \(g_t\) is the stochastic gradient, \(m_t\) is the momentum buffer with decay \(\beta_1\), \(G_t\) is the coordinate-wise max buffer, \(n_t\) is uniform noise drawn independently per coordinate, \(\odot\) is the elementwise product, \(\eta_t\) is the learning rate, and \(\lambda\) is the weight decay. The stochastic sign operator is \(\mathrm{sign}(m_t + G_t \odot n_t)\), which is unbiased after coordinate-wise rescaling: \(\mathbb{E}[\mathrm{sign}(x + G \odot n)] = x / G\).
Reference: Dingzhi Yu, Rui Pan, Yuxing Liu, Tong Zhang, "StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models", arXiv 2026. https://arxiv.org/abs/2604.15416