Skip to content

DAdaptSGD

Implements SGD with D-Adaptation automatic step sizes.

\[ \begin{aligned} \lambda_t &= \frac{d_t \gamma}{\lVert g_0 \rVert} \\ s_{t+1} &= s_t + \lambda_t g_t \\ z_{t+1} &= z_t - \lambda_t g_t \\ \theta_{t+1} &= \beta \theta_t + (1 - \beta) z_{t+1} \\ \hat{d}_{t+1} &= \frac{2 \sum_{i=0}^{t} \lambda_i \langle g_i, s_i \rangle}{\lVert s_{t+1} \rVert} \\ d_{t+1} &= \max(d_t, \hat{d}_{t+1}) \end{aligned} \]

where \(\gamma\) is lr and \(\beta\) is momentum.

Note: lr rescales the D-adapted step size and should normally stay

at its default of 1.0.

Reference: Aaron Defazio and Konstantin Mishchenko, "Learning-Rate-Free Learning by D-Adaptation", ICML 2023. https://arxiv.org/abs/2301.07733


Back to the Canon