DoG¶
Implements DoG, the parameter-free distance-over-gradients step size schedule.
\[
\begin{aligned}
\bar{r}_t &= \max\bigl(\bar{r}_{t-1},\, \lVert \theta_t - \theta_0 \rVert\bigr) \\
G_t &= \sum_{i \le t} \lVert g_i \rVert^2 \\
\eta_t &= \frac{\bar{r}_t}{\sqrt{G_t}} \\
\theta_{t+1} &= \theta_t - \eta_t\, g_t
\end{aligned}
\]
where the initial distance estimate is
\(\bar{r}_0 = r_\epsilon = \alpha\,(1 + \lVert \theta_0 \rVert)\) with
\(\alpha\) given by reps_rel, and lr enters only as a constant
multiplier \(c\) on \(\eta_t\).
Note: Leave lr at its default of 1.0. The paper recommends pairing DoG with polynomial decay iterate averaging, and raising reps_rel to 1e-4 for models that use batch normalization.
Reference: Maor Ivgi, Oliver Hinder, Yair Carmon, "DoG is SGD's Best Friend: A Parameter-Free Dynamic Step Size Schedule", ICML 2023. https://arxiv.org/abs/2302.12022