Apollo¶
Implements Apollo, an adaptive parameter-wise diagonal quasi-Newton method for nonconvex stochastic optimization.
Apollo approximates the Hessian by a diagonal matrix \(B_t\) that is updated to satisfy a parameter-wise weak secant condition using only first-order gradient information, giving linear time and memory cost. To guarantee a descent direction in the nonconvex setting, the diagonal preconditioner is rectified by taking element-wise absolute values bounded below by a convexity threshold \(\sigma\).
where \(\theta\) are the parameters, \(\eta\) the learning rate, \(g_t\) the gradient, \(m_t\) the bias-corrected exponential moving average of gradients with decay \(\beta\), \(B_t\) the diagonal Hessian approximation, \(d_t\) the update direction, \(\mathrm{Diag}(d_{t-1}^2)\) the diagonal matrix of squared direction entries, \(D_t = \max(\lvert B_t \rvert, \sigma)\) the rectified preconditioner with convexity threshold \(\sigma\), and \(\epsilon\) a stability constant.
Reference: Xuezhe Ma, "Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization", arXiv 2020. https://arxiv.org/abs/2009.13586