Skip to content

Distributed and Communication-Efficient Optimizers

Optimizers in this category target training across many devices or nodes, where memory and inter-worker communication are the main bottlenecks. They shard optimizer state, compress gradient exchange, or synchronize infrequently so that training scales without a proportional increase in bandwidth. Some entries are standalone update rules, while others wrap an inner optimizer with a communication-efficient outer loop.

Optimizer Venue Paper Code zij
signSGD ICML 2018 signSGD: Compressed Optimisation for Non-Convex Problems official
LD-SGD arXiv 2019 Communication-Efficient Local Decentralized SGD Methods
Local SGD ICLR 2019 Local SGD Converges Fast and Communicates Little community
PowerSGD NeurIPS 2019 PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization
Qsparse-local-SGD NeurIPS 2019 Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification, and Local Computations
signProx ICASSP 2019 signProx: One-Bit Proximal Algorithm for Nonconvex Stochastic Optimization
APMSqueeze arXiv 2020 APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm
DEED-GD arXiv 2020 DEED: A General Quantization Scheme for Communication Efficiency in Bits
FedAC NeurIPS 2020 Federated Accelerated Stochastic Gradient Descent
LAGS-SGD ECAI 2020 Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees
rTop-k JSAIT 2020 rTop-k: A Statistical Estimation Approach to Distributed SGD
SCAFFOLD ICML 2020 SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
SlowMo ICLR 2020 SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum
ZeRO SC 2020 ZeRO: Memory Optimizations Toward Training Trillion Parameter Models official
1-bit Adam ICML 2021 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed official
BVR-L-SGD ICML 2021 Bias-Variance Reduced Local SGD for Less Heterogeneous Federated Learning
SQuARM-SGD JSAIT 2021 SQuARM-SGD: Communication-Efficient Momentum SGD for Decentralized Optimization
SketchedAMSGrad ICDM 2022 Communication-Efficient Adam-Type Algorithms for Distributed Data Mining
0/1 Adam ICLR 2023 Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam official
AdaCGD TMLR 2023 Adaptive Compression for Communication-Efficient Distributed Training
DiLoCo arXiv 2023 DiLoCo: Distributed Low-Communication Training of Language Models community
Distributed Shampoo arXiv 2023 A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale official
SPARQ-SGD TAC 2023 SPARQ-SGD: Event-Triggered and Compressed Communication in Decentralized Stochastic Optimization
AdaFedAdam TMLCN 2024 Accelerating Fair Federated Learning: Adaptive Federated Adam official
DeMo arXiv 2024 DeMo: Decoupled Momentum Optimization official
FADAS ICML 2024 FADAS: Towards Federated Adaptive Asynchronous Optimization official
FAGH arXiv 2024 FAGH: Accelerating Federated Learning with Approximated Global Hessian
Fed-Sophia ICC 2024 Fed-Sophia: A Communication-Efficient Second-Order Federated Learning Algorithm
FedLion ICASSP 2024 FedLion: Faster Adaptive Federated Optimization with Fewer Communication official
FedRepOpt ACCV 2024 FedRepOpt: Gradient Re-parametrized Optimizers in Federated Learning official
FedSTaS arXiv 2024 FedSTaS: Client Stratification and Client Level Sampling for Efficient Federated Learning official
FESS-GDA AISTATS 2024 Stochastic Smoothed Gradient Descent Ascent for Federated Minimax Optimization
FLeNS BigData 2024 FLeNS: Federated Learning with Enhanced Nesterov-Newton Sketch official
MM-PSGD / MC-PSGD MMAsia-W 2024 Distributed Optimization over Block-Cyclic Data
OpenDiLoCo arXiv 2024 OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training official
ADEF arXiv 2025 Accelerated Distributed Optimization with Compression and Error Feedback
DAT-SGD ICML 2025 Enhancing Parallelism in Decentralized Stochastic Convex Optimization
DeCo-SGD arXiv 2025 Taming Latency and Bandwidth: A Theoretical Framework and Adaptive Algorithm for Communication-Constrained Training
DES-LOC arXiv 2025 DES-LOC: Desynced Low Communication Adaptive Optimizers for Training Foundation Models
Dion arXiv 2025 Dion: Distributed Orthonormalized Updates official
DLAS-R-FTC CDC 2025 Distributed Optimization and Learning for Automated Stepsize Selection with Finite Time Coordination
FAdamGC arXiv 2025 Gradient Correction in Federated Learning with Adaptive Optimization
FedCET arXiv 2025 Communication Efficient Federated Learning with Linear Convergence on Heterogeneous Data
FedIvon TMLR 2025 Federated Learning with Uncertainty and Personalization via Efficient Second-order Optimization
FedMuon arXiv 2025 FedMuon: Accelerating Federated Learning with Matrix Orthogonalization official
FedOne ICML 2025 FedOne: Query-Efficient Federated Learning for Black-box Discrete Prompt Learning
HybridSGD arXiv 2025 Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization
Kuramoto-FedAvg arXiv 2025 Kuramoto-FedAvg: Using Synchronization Dynamics to Improve Federated Learning Optimization under Statistical Heterogeneity official
LQ-SGD arXiv 2025 Trustworthy Efficient Communication for Distributed Learning using LQ-SGD Algorithm
Muon arXiv 2025 Muon is Scalable for LLM Training official Muon
pFedSOP arXiv 2025 pFedSOP: Accelerating Training Of Personalized Federated Learning Using Second-Order Optimization
LT-ADMM TAC 2026 Communication-Efficient Stochastic Distributed Learning
Ringleader ASGD ICLR 2026 Ringleader ASGD: The First Asynchronous SGD with Optimal Time Complexity under Data Heterogeneity
DECA arXiv 2026 DECA: Decentralizing Block-Wise Adam for Efficient LLM Full-Parameter Fine-Tuning on Non-IID Data
Ringmaster LMO arXiv 2026 Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method
SignMuon arXiv 2026 SignMuon: Communication-Efficient Distributed Muon Optimization
Orth-Dion arXiv 2026 Orth-Dion: Eliminating Geometric Mismatch in Distributed Low-Rank Spectral Optimization
EF21-Muon arXiv 2025 Error Feedback for Muon and Friends
MuonBP ICLR 2026 MuonBP: Faster Muon via Block-Periodic Orthogonalization
CurvaDion arXiv 2025 CurvaDion: Curvature-Adaptive Distributed Orthonormalization
Quasi-Newton FL with Error Feedback OPT 2025: Optimization for Machine Learning (NeurIPS 2025 Workshop) Quasi-Newton Methods for Federated Learning with Error Feedback
DeMuon arXiv 2025 DeMuon: A Decentralized Muon for Matrix Optimization over Graphs
HeLoCo arXiv 2026 HeLoCo: Efficient asynchronous low-communication training under data and device heterogeneity
Decoupled DiLoCo arXiv 2026 Decoupled DiLoCo for Resilient Distributed Pre-training
Partial Parameter Updates arXiv 2025 Partial Parameter Updates for Efficient Distributed Training
SparseLoCo arXiv 2025 Communication Efficient LLM Pre-training with SparseLoCo official
GASLoC arXiv 2026 Unifying Local Communications and Local Updates for LLM Pretraining
MG-ADSGD arXiv 2026 Accelerated Decentralized Stochastic Gradient Descent for Strongly Convex Optimization
Local MixVR arXiv 2026 Local MixVR: Breaking the Communication-Sample Dependence in Distributed Learning
LOSCAR-SGD arXiv 2026 LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
HEW-Local SGD arXiv (math.OC) 2026 Heterogeneous-Horizon Exact-Weight Local SGD
CAPTAIN (C-ALADIN) arXiv 2026 A Global Convergence Analysis of Consensus ALADIN for Convex Optimization
FedPAC arXiv 2026 Taming Preconditioner Drift: Unlocking the Potential of Second-Order Optimizers for Federated Learning on Non-IID Data official
FedAdamW AAAI 2026 FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models official
LoRDO arXiv 2026 LoRDO: Distributed Low-Rank Optimization with Infrequent Communication