FUSE¶
Implements FUSE, a unified synthesis of a first-order Adam step and a second-order L-BFGS step.
FUSE runs both an Adam update and an L-BFGS quasi-Newton update at each iteration and blends them by a convex weight \(\theta \in [0,1]\): \(\theta = 1\) recovers pure Adam, \(\theta = 0\) pure L-BFGS, and intermediate values mix the cheap first-order direction with the curvature-aware second-order direction. The L-BFGS search direction \(p_k\) comes from the standard two-loop recursion over a history of \((s_i, y_i)\) pairs, and its step size \(\alpha_k\) is chosen by a Wolfe line search. A practical variant (FUSE-PV) hard-switches between the two (\(\theta \in \{0,1\}\)) once a switchover criterion on the gradient norm or loss change is met.
where \(x\) are the parameters, \(\alpha\) the Adam learning rate, \(g(x_k)\) the gradient, \(m_k\) and \(v_k\) the first- and second-moment estimates with decay rates \(\beta_1,\beta_2\), \(a\) a small stability constant, \(\odot\) and \(\oslash\) elementwise product and division, \(p_k\) the L-BFGS direction from two-loop recursion, \(\alpha_k\) its Wolfe-line-search step, and \(\theta \in [0,1]\) the weight blending the Adam step \(x^A_{k+1}\) with the L-BFGS step \(x^L_{k+1}\).
Reference: Zhanhong Jiang, Md Zahid Hasan, Aditya Balu, Joshua R. Waite, Genyi Huang, Soumik Sarkar, "FUSE: First-Order and Second-Order Unified SynthEsis in Stochastic Optimization", arXiv 2025. https://arxiv.org/abs/2503.04204