Tutel by microsoft

Optimized MoE library for modern training and inference

Created 4 years ago

954 stars

Top 38.5% on SourcePulse

View on GitHub

2 Experts Love This Project

Bryan Helmig

Cofounder of Zapier

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

Tutel is an optimized Mixture-of-Experts (MoE) library for PyTorch, designed to enhance training and inference performance for large language models. It offers advanced parallelism and sparsity features, targeting researchers and engineers working with large-scale AI models, particularly those leveraging advanced hardware like NVIDIA A100/H100 and AMD MI300.

How It Works

Tutel implements a novel "No-penalty Parallelism/Sparsity/Capacity Switching" approach, allowing dynamic adjustments to MoE configurations without performance degradation. It optimizes communication primitives like all-to-all operations and supports advanced quantization techniques, including DeepSeek FP8 and FP4, to maximize hardware utilization and throughput.

Quick Start & Requirements

Installation: pip install -v -U --no-build-isolation git+https://github.com/microsoft/tutel@main or build from source.
Prerequisites: PyTorch (>= 1.10 recommended), CUDA (>= 11.7) or ROCm (>= 6.2.2).
Resources: Supports multi-GPU setups (NVIDIA/AMD) and CPU.
Docs: https://github.com/microsoft/tutel

Highlighted Details

First to support DeepSeek FP4 inference on A100/H100/MI300 hardware.
Achieves significantly higher decode TPS compared to TRT-LLM and SGLANG on MI300X for FP4 models.
Offers dynamic switching for parallelism, sparsity, and capacity with minimal overhead.
Integrates Megablocks for improved decoder inference on single GPUs.

Maintenance & Community

Actively developed by Microsoft.
Contributions are welcomed via pull requests, subject to a Contributor License Agreement (CLA).
Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

Licensed under the MIT License.
Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

While supporting various hardware, specific optimizations might be more mature for NVIDIA GPUs.
The README mentions support for ROCm but provides limited examples for AMD hardware compared to NVIDIA.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days