VSGD¶
Implements VSGD (Variational Stochastic Gradient Descent), a Bayesian optimizer that treats the true gradient as a latent variable inferred from noisy stochastic gradients.
VSGD places a hierarchical Gaussian–Gamma model on the observed gradient \(\hat{g}_t\) and the latent true gradient, then performs coordinate-wise variational inference. The posterior mean \(\mu_{t,g}\) and variance \(\sigma_{t,g}^2\) of the gradient are obtained by a precision-weighted fusion of the running estimate and the new observation, where the precisions are themselves estimated online through Gamma rate parameters \(b_{t,g}\) (state precision) and \(b_{t,\hat{g}}\) (observation precision). The parameter step rescales the inferred mean by the inverse root second moment, recovering an Adam-like denominator from first principles.
where \(\hat{g}_t\) is the stochastic gradient, \(\mu_{t,g}\) and \(\sigma_{t,g}^2\) are the variational posterior mean and variance of the true gradient, \(a_{t,g} = a_{t,\hat{g}} = \gamma + \tfrac{1}{2}\) are the (fixed) Gamma shape parameters, \(b_{t,g}\) and \(b_{t,\hat{g}}\) are the Gamma rate parameters for the state and observation precisions, \(\gamma\) is the prior rate and \(K_g\) a prior scaling constant, \(\rho_{t,1} = t^{-\kappa_1}\) and \(\rho_{t,2} = t^{-\kappa_2}\) are decaying step sizes (with \(\kappa_1, \kappa_2 \in (0.5, 1]\) satisfying the Robbins–Monro conditions), and \(\eta\) is the learning rate.
Reference: Haotian Chen, Anna Kuzina, Babak Esmaeili, Jakub M. Tomczak, "Variational Stochastic Gradient Descent for Deep Neural Networks", ICML 2024. https://arxiv.org/abs/2404.06549