Tutel  by microsoft

Optimized MoE library for modern training and inference

Created 4 years ago
934 stars

Top 39.1% on SourcePulse

GitHubView on GitHub
Project Summary

Tutel is an optimized Mixture-of-Experts (MoE) library for PyTorch, designed to enhance training and inference performance for large language models. It offers advanced parallelism and sparsity features, targeting researchers and engineers working with large-scale AI models, particularly those leveraging advanced hardware like NVIDIA A100/H100 and AMD MI300.

How It Works

Tutel implements a novel "No-penalty Parallelism/Sparsity/Capacity Switching" approach, allowing dynamic adjustments to MoE configurations without performance degradation. It optimizes communication primitives like all-to-all operations and supports advanced quantization techniques, including DeepSeek FP8 and FP4, to maximize hardware utilization and throughput.

Quick Start & Requirements

  • Installation: pip install -v -U --no-build-isolation git+https://github.com/microsoft/tutel@main or build from source.
  • Prerequisites: PyTorch (>= 1.10 recommended), CUDA (>= 11.7) or ROCm (>= 6.2.2).
  • Resources: Supports multi-GPU setups (NVIDIA/AMD) and CPU.
  • Docs: https://github.com/microsoft/tutel

Highlighted Details

  • First to support DeepSeek FP4 inference on A100/H100/MI300 hardware.
  • Achieves significantly higher decode TPS compared to TRT-LLM and SGLANG on MI300X for FP4 models.
  • Offers dynamic switching for parallelism, sparsity, and capacity with minimal overhead.
  • Integrates Megablocks for improved decoder inference on single GPUs.

Maintenance & Community

  • Actively developed by Microsoft.
  • Contributions are welcomed via pull requests, subject to a Contributor License Agreement (CLA).
  • Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

  • Licensed under the MIT License.
  • Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

  • While supporting various hardware, specific optimizations might be more mature for NVIDIA GPUs.
  • The README mentions support for ROCm but provides limited examples for AMD hardware compared to NVIDIA.
Health Check
Last Commit

4 weeks ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Philipp Schmid Philipp Schmid(DevRel at Google DeepMind), and
2 more.

SageAttention by thu-ml

1.1%
3k
Attention kernel for plug-and-play inference acceleration
Created 1 year ago
Updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
20k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 18 hours ago
Feedback? Help us improve.