dion by microsoft

Orthonormal updates for faster distributed ML training

Created 6 months ago

390 stars

Top 73.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Wing Lian

Founder of Axolotl AI

Project Summary

Dion and Muon are PyTorch optimizers designed to accelerate neural network training by employing orthonormal weight updates, offering faster convergence than traditional methods like Adam/AdamW. They are particularly beneficial for large-scale distributed training scenarios, targeting researchers and engineers working with modern PyTorch and DTensor-based parallelism.

How It Works

Dion utilizes amortized power iteration for orthonormalization, enabling direct application on sharded matrices and supporting low-rank compression via a rank fraction hyperparameter. This approach reduces communication overhead compared to Muon's Newton-Schulz iterations, which require reconstructing full matrices from shards. Dion also incorporates an error feedback mechanism to mitigate information loss from compression.

Quick Start & Requirements

Install: pip install git+https://github.com/microsoft/dion.git
Prerequisites: PyTorch 2.7+ with DTensor-based parallelism (FSDP2, TP).
Setup: Clone the repo, install dependencies (pip install -e .[train]), download the FineWeb dataset, and run training scripts (e.g., torchrun --standalone --nproc_per_node=8 train.py --config configs/dion_160m.yaml).
Documentation: README

Highlighted Details

Supports PyTorch DDP, FSDP2, and FSDP2 + TP (for Dion).
Requires manual parameter grouping for matrix weights, biases, embeddings, and unembeddings.
Offers compressed gradient synchronization for data-parallel training.
Includes experimental features like mixed-precision optimizer states and Triton kernels for Muon.

Maintenance & Community

Developed by Microsoft.
Paper: arXiv:2504.05295

Licensing & Compatibility

License: Not explicitly stated in the README, but typically Microsoft open-source projects use MIT or similar permissive licenses. Compatibility for commercial use is likely, but requires verification.

Limitations & Caveats

Dion does not support convolution layers directly; Muon has experimental support with flattening.
Parameter grouping is manual and critical for correct optimizer behavior, especially for embedding/unembedding layers.
Compressed gradient synchronization with replicate_mesh_grad_sync=True leads to decoupled momentum states across data-parallel processes.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

14 stars in the last 30 days