Skip to content

Cayley SGD

Implements Cayley SGD, momentum SGD constrained to the Stiefel manifold via the Cayley transform.

Cayley SGD optimizes parameters that must stay orthonormal (columns of \(X\) with \(X^\top X = I\)). It accumulates momentum in Euclidean space, projects it onto the tangent space of the manifold as a skew-symmetric matrix \(W\), and moves along the resulting curve using the Cayley transform \(Y(\alpha) = (I - \tfrac{\alpha}{2}W)^{-1}(I + \tfrac{\alpha}{2}W)X\), which preserves orthonormality exactly. To avoid the matrix inverse, the transform is evaluated by a fixed-point iteration, and an adaptive step size keeps the curve approximation accurate.

\[ \begin{aligned} m_{t+1} &\leftarrow \beta m_t - g_t \\ \hat{W}_t &= m_{t+1} X_t^\top - \tfrac{1}{2} X_t \left( X_t^\top m_{t+1} X_t^\top \right) \\ W_t &= \hat{W}_t - \hat{W}_t^\top \\ m_{t+1} &\leftarrow W_t X_t \\ \alpha &= \min\left\{ \eta,\ \frac{2q}{\lVert W_t \rVert + \epsilon} \right\} \\ Y^{0} &= X_t + \alpha\, m_{t+1} \\ Y^{i} &= X_t + \tfrac{\alpha}{2} W_t \left( X_t + Y^{i-1} \right), \quad i = 1,\dots,s \\ X_{t+1} &= Y^{s} \end{aligned} \]

where \(X_t\) is the orthonormal parameter matrix, \(g_t = G(X_t)\) is the Euclidean gradient, \(m_t\) is the momentum, \(\beta\) the momentum coefficient, \(W_t\) the skew-symmetric tangent direction, \(\eta\) the base learning rate, \(q\) a step-size constant (default \(0.5\)), \(s\) the number of fixed-point iterations (default \(2\)), and \(\epsilon\) a small constant for stability.

Reference: Jun Li, Li Fuxin, Sinisa Todorovic, "Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform", ICLR 2020. https://arxiv.org/abs/2002.01113


Back to the Canon