dion  by microsoft

Orthonormal updates for faster distributed ML training

Created 4 months ago
366 stars

Top 77.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Dion and Muon are PyTorch optimizers designed to accelerate neural network training by employing orthonormal weight updates, offering faster convergence than traditional methods like Adam/AdamW. They are particularly beneficial for large-scale distributed training scenarios, targeting researchers and engineers working with modern PyTorch and DTensor-based parallelism.

How It Works

Dion utilizes amortized power iteration for orthonormalization, enabling direct application on sharded matrices and supporting low-rank compression via a rank fraction hyperparameter. This approach reduces communication overhead compared to Muon's Newton-Schulz iterations, which require reconstructing full matrices from shards. Dion also incorporates an error feedback mechanism to mitigate information loss from compression.

Quick Start & Requirements

  • Install: pip install git+https://github.com/microsoft/dion.git
  • Prerequisites: PyTorch 2.7+ with DTensor-based parallelism (FSDP2, TP).
  • Setup: Clone the repo, install dependencies (pip install -e .[train]), download the FineWeb dataset, and run training scripts (e.g., torchrun --standalone --nproc_per_node=8 train.py --config configs/dion_160m.yaml).
  • Documentation: README

Highlighted Details

  • Supports PyTorch DDP, FSDP2, and FSDP2 + TP (for Dion).
  • Requires manual parameter grouping for matrix weights, biases, embeddings, and unembeddings.
  • Offers compressed gradient synchronization for data-parallel training.
  • Includes experimental features like mixed-precision optimizer states and Triton kernels for Muon.

Maintenance & Community

Licensing & Compatibility

  • License: Not explicitly stated in the README, but typically Microsoft open-source projects use MIT or similar permissive licenses. Compatibility for commercial use is likely, but requires verification.

Limitations & Caveats

  • Dion does not support convolution layers directly; Muon has experimental support with flattening.
  • Parameter grouping is manual and critical for correct optimizer behavior, especially for embedding/unembedding layers.
  • Compressed gradient synchronization with replicate_mesh_grad_sync=True leads to decoupled momentum states across data-parallel processes.
Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
26 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
11 more.

Liger-Kernel by linkedin

0.5%
6k
Triton kernels for efficient LLM training
Created 1 year ago
Updated 1 day ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
27 more.

ColossalAI by hpcaitech

0.0%
41k
AI system for large-scale parallel training
Created 4 years ago
Updated 1 day ago
Feedback? Help us improve.