Tutel  by microsoft

Optimized MoE library for modern training and inference

created 4 years ago
870 stars

Top 42.2% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Tutel is an optimized Mixture-of-Experts (MoE) library for PyTorch, designed to enhance training and inference performance for large language models. It offers advanced parallelism and sparsity features, targeting researchers and engineers working with large-scale AI models, particularly those leveraging advanced hardware like NVIDIA A100/H100 and AMD MI300.

How It Works

Tutel implements a novel "No-penalty Parallelism/Sparsity/Capacity Switching" approach, allowing dynamic adjustments to MoE configurations without performance degradation. It optimizes communication primitives like all-to-all operations and supports advanced quantization techniques, including DeepSeek FP8 and FP4, to maximize hardware utilization and throughput.

Quick Start & Requirements

  • Installation: pip install -v -U --no-build-isolation git+https://github.com/microsoft/tutel@main or build from source.
  • Prerequisites: PyTorch (>= 1.10 recommended), CUDA (>= 11.7) or ROCm (>= 6.2.2).
  • Resources: Supports multi-GPU setups (NVIDIA/AMD) and CPU.
  • Docs: https://github.com/microsoft/tutel

Highlighted Details

  • First to support DeepSeek FP4 inference on A100/H100/MI300 hardware.
  • Achieves significantly higher decode TPS compared to TRT-LLM and SGLANG on MI300X for FP4 models.
  • Offers dynamic switching for parallelism, sparsity, and capacity with minimal overhead.
  • Integrates Megablocks for improved decoder inference on single GPUs.

Maintenance & Community

  • Actively developed by Microsoft.
  • Contributions are welcomed via pull requests, subject to a Contributor License Agreement (CLA).
  • Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

  • Licensed under the MIT License.
  • Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

  • While supporting various hardware, specific optimizations might be more mature for NVIDIA GPUs.
  • The README mentions support for ROCm but provides limited examples for AMD hardware compared to NVIDIA.
Health Check
Last commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
6
Star History
59 stars in the last 90 days

Explore Similar Projects

Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
12 more.

DeepSpeed by deepspeedai

0.2%
40k
Deep learning optimization library for distributed training and inference
created 5 years ago
updated 1 day ago
Feedback? Help us improve.