CaAdam¶
Implements CaAdam, a connection-aware variant of Adam that scales the learning rate per layer using architectural information.
CaAdam keeps the standard Adam moment estimates and bias correction, but multiplies the step by a per-layer scaling factor \(S\) derived from the network's structure rather than from the gradient statistics. The intuition is that layers differ in their number of connections (or their depth), so a single global learning rate is suboptimal; the scaling acts as a structural prior on the effective step size. Three scaling schemes are proposed: an additive and a multiplicative scheme centered on the median connection count, and a depth-based scheme.
where \(c\) is the number of connections of the layer a parameter belongs to, \(\tilde{c}\), \(c_{\min}\), \(c_{\max}\) are the median, minimum, and maximum connection counts across layers, \(d\) is the depth of the current layer and \(d_m\) the total network depth, \(\gamma\) is the scaling strength (default \(0.95\)), and \(S\) is whichever of \(S_{\mathrm{add}}\), \(S_{\mathrm{mul}}\), \(S_{\mathrm{depth}}\) is selected.
Reference: Rémi Genet, Hugo Inzirillo, "CaAdam: Improving Adam optimizer using connection aware methods", arXiv 2024. https://arxiv.org/abs/2410.24216