DyKAF¶
Implements DyKAF, an Adam-style update preconditioned by a dynamical Kronecker approximation of the Fisher information matrix.
DyKAF treats a matrix parameter \(W \in \mathbb{R}^{m \times n}\) and approximates the empirical Fisher \(F_t = \sum_{i \le t} \mathrm{vec}(G_i)\,\mathrm{vec}(G_i)^\top\) by a Kronecker product \(F_t \approx L_t \otimes R_t\), with the factors \(L_t \in \mathbb{R}^{m \times m}\) and \(R_t \in \mathbb{R}^{n \times n}\) tracked over time by a low-rank projector-splitting integrator. Diagonalizing the factors as \(L_t = Q_L \Lambda_L Q_L^\top\) and \(R_t = Q_R \Lambda_R Q_R^\top\) rotates the gradient into the Fisher eigenbasis, where preconditioning reduces to a diagonal (Adam-like) rescaling before rotating back.
Each step rotates the momentum, divides it by a square-rooted second moment accumulated in the rotated basis, rotates back, and applies the step; the Kronecker factors and their eigenvectors \(Q_L, Q_R\) are refreshed periodically.
where \(\theta = W\) are the parameters, \(\eta\) the learning rate, \(g_t\) the gradient, \(m_t\) the momentum, \(v_t\) the second moment formed in the rotated basis, \(\beta_1, \beta_2\) the decay rates, \(\epsilon\) the stability constant, \(Q_L, Q_R\) the eigenvector matrices of the Kronecker factors \(L_t, R_t\), and \(\odot, \oslash, (\cdot)^{\circ 1/2}\) elementwise product, division, and square root.
Reference: Nikolay Yudin, Ekaterina Grishina, Andrey Veprikov, Alexandr Beznosikov, Maxim Rakhuba, "DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning", arXiv 2025. https://arxiv.org/abs/2511.06477