dion  by microsoft

Orthonormal updates for faster distributed ML training

created 2 months ago
284 stars

Top 92.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Dion and Muon are PyTorch optimizers designed to accelerate neural network training by employing orthonormal weight updates, offering faster convergence than traditional methods like Adam/AdamW. They are particularly beneficial for large-scale distributed training scenarios, targeting researchers and engineers working with modern PyTorch and DTensor-based parallelism.

How It Works

Dion utilizes amortized power iteration for orthonormalization, enabling direct application on sharded matrices and supporting low-rank compression via a rank fraction hyperparameter. This approach reduces communication overhead compared to Muon's Newton-Schulz iterations, which require reconstructing full matrices from shards. Dion also incorporates an error feedback mechanism to mitigate information loss from compression.

Quick Start & Requirements

  • Install: pip install git+https://github.com/microsoft/dion.git
  • Prerequisites: PyTorch 2.7+ with DTensor-based parallelism (FSDP2, TP).
  • Setup: Clone the repo, install dependencies (pip install -e .[train]), download the FineWeb dataset, and run training scripts (e.g., torchrun --standalone --nproc_per_node=8 train.py --config configs/dion_160m.yaml).
  • Documentation: README

Highlighted Details

  • Supports PyTorch DDP, FSDP2, and FSDP2 + TP (for Dion).
  • Requires manual parameter grouping for matrix weights, biases, embeddings, and unembeddings.
  • Offers compressed gradient synchronization for data-parallel training.
  • Includes experimental features like mixed-precision optimizer states and Triton kernels for Muon.

Maintenance & Community

Licensing & Compatibility

  • License: Not explicitly stated in the README, but typically Microsoft open-source projects use MIT or similar permissive licenses. Compatibility for commercial use is likely, but requires verification.

Limitations & Caveats

  • Dion does not support convolution layers directly; Muon has experimental support with flattening.
  • Parameter grouping is manual and critical for correct optimizer behavior, especially for embedding/unembedding layers.
  • Compressed gradient synchronization with replicate_mesh_grad_sync=True leads to decoupled momentum states across data-parallel processes.
Health Check
Last commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
1
Star History
288 stars in the last 30 days

Explore Similar Projects

Starred by Amanpreet Singh Amanpreet Singh(Cofounder of Contextual AI) and Ross Taylor Ross Taylor(Cofounder of General Reasoning; Creator of Papers with Code).

torchshard by kaiyuyue

0%
300
PyTorch engine for tensor slicing into parallel shards
created 4 years ago
updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.6%
3k
High-performance 4-bit diffusion model inference engine
created 9 months ago
updated 1 day ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
10 more.

FasterTransformer by NVIDIA

0.1%
6k
Optimized transformer library for inference
created 4 years ago
updated 1 year ago
Feedback? Help us improve.