Orthonormal updates for faster distributed ML training
Top 92.0% on SourcePulse
Dion and Muon are PyTorch optimizers designed to accelerate neural network training by employing orthonormal weight updates, offering faster convergence than traditional methods like Adam/AdamW. They are particularly beneficial for large-scale distributed training scenarios, targeting researchers and engineers working with modern PyTorch and DTensor-based parallelism.
How It Works
Dion utilizes amortized power iteration for orthonormalization, enabling direct application on sharded matrices and supporting low-rank compression via a rank fraction hyperparameter. This approach reduces communication overhead compared to Muon's Newton-Schulz iterations, which require reconstructing full matrices from shards. Dion also incorporates an error feedback mechanism to mitigate information loss from compression.
Quick Start & Requirements
pip install git+https://github.com/microsoft/dion.git
pip install -e .[train]
), download the FineWeb dataset, and run training scripts (e.g., torchrun --standalone --nproc_per_node=8 train.py --config configs/dion_160m.yaml
).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
replicate_mesh_grad_sync=True
leads to decoupled momentum states across data-parallel processes.5 days ago
Inactive