AdamP¶
Implements AdamP, Adam with a scale-invariant projection step.
For each layer-weight parameter, the Adam update \(p_t\) is split into its radial and tangential components relative to the weight \(\theta\), and the radial part is removed whenever the cosine similarity between the gradient and the weight is below a threshold (i.e. the weight is treated as scale-invariant):
where \(\Pi_{\theta}(p) = p - (\hat{\theta} \cdot p)\,\hat{\theta}\)
projects out the component of \(p\) along the unit weight
\(\hat{\theta}\) and wd_ratio scales the decoupled weight decay on
the projected parameters.
Reference: Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, Jung-Woo Ha, "AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights", ICLR 2021. https://arxiv.org/abs/2006.08217