Tutel  by microsoft

Optimized MoE library for modern training and inference

Created 4 years ago
914 stars

Top 39.9% on SourcePulse

GitHubView on GitHub
Project Summary

Tutel is an optimized Mixture-of-Experts (MoE) library for PyTorch, designed to enhance training and inference performance for large language models. It offers advanced parallelism and sparsity features, targeting researchers and engineers working with large-scale AI models, particularly those leveraging advanced hardware like NVIDIA A100/H100 and AMD MI300.

How It Works

Tutel implements a novel "No-penalty Parallelism/Sparsity/Capacity Switching" approach, allowing dynamic adjustments to MoE configurations without performance degradation. It optimizes communication primitives like all-to-all operations and supports advanced quantization techniques, including DeepSeek FP8 and FP4, to maximize hardware utilization and throughput.

Quick Start & Requirements

  • Installation: pip install -v -U --no-build-isolation git+https://github.com/microsoft/tutel@main or build from source.
  • Prerequisites: PyTorch (>= 1.10 recommended), CUDA (>= 11.7) or ROCm (>= 6.2.2).
  • Resources: Supports multi-GPU setups (NVIDIA/AMD) and CPU.
  • Docs: https://github.com/microsoft/tutel

Highlighted Details

  • First to support DeepSeek FP4 inference on A100/H100/MI300 hardware.
  • Achieves significantly higher decode TPS compared to TRT-LLM and SGLANG on MI300X for FP4 models.
  • Offers dynamic switching for parallelism, sparsity, and capacity with minimal overhead.
  • Integrates Megablocks for improved decoder inference on single GPUs.

Maintenance & Community

  • Actively developed by Microsoft.
  • Contributions are welcomed via pull requests, subject to a Contributor License Agreement (CLA).
  • Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

  • Licensed under the MIT License.
  • Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

  • While supporting various hardware, specific optimizations might be more mature for NVIDIA GPUs.
  • The README mentions support for ROCm but provides limited examples for AMD hardware compared to NVIDIA.
Health Check
Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
4
Star History
24 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Philipp Schmid Philipp Schmid(DevRel at Google DeepMind), and
2 more.

SageAttention by thu-ml

1.3%
2k
Attention kernel for plug-and-play inference acceleration
Created 11 months ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
20k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 1 day ago
Feedback? Help us improve.