Triton-distributed  by ByteDance-Seed

Distributed compiler for computation-communication overlapping, based on Triton

Created 9 months ago
1,309 stars

Top 30.3% on SourcePulse

GitHubView on GitHub
Project Summary

Triton-distributed is a distributed compiler based on OpenAI Triton, designed to create efficient kernels for parallel systems by overlapping computation and communication. It targets researchers and engineers developing high-performance distributed AI models, offering primitives to simplify the creation of complex communication patterns and achieve performance comparable to or better than hand-tuned libraries.

How It Works

The project extends OpenAI's Triton with a set of low-level primitives for distributed programming. These primitives abstract complex communication operations like AllToAll and GEMM, allowing developers to focus on the computation-communication overlap. The design emphasizes enabling programmers to write kernels that match or exceed the performance of specialized libraries, leveraging hardware interconnects like NVLink and InfiniBand.

Quick Start & Requirements

  • Install from source.
  • Requires NVIDIA GPUs (SM80, SM89, SM90a) or AMD GPUs (CDNA3).
  • Supports NVLink and InfiniBand for communication.
  • See Build Guide for detailed instructions.

Highlighted Details

  • Achieves 137us for AllToAll on 32 H800 GPUs (128 tokens/rank, fp8), outperforming DeepEP.
  • Provides example implementations for distributed GEMM, MoE, and Flash-Decoding kernels.
  • Supports cross-node communication and computation-communication overlapping kernels.
  • Includes performance benchmarks on NVIDIA H800 GPUs and scaling analysis.

Maintenance & Community

  • Developed by the ByteDance Seed Team.
  • Contributions via issues and pull requests.
  • No explicit community channels (Discord/Slack) mentioned.

Licensing & Compatibility

  • MIT License for the core project.
  • Apache-2.0 License for specific kernel implementations (e.g., flash_decode.py) and parts of Triton's original code.
  • Compatible with commercial use, but note the dual licensing for specific components.

Limitations & Caveats

  • Currently only low-level primitives are released; high-level primitives and tutorials are planned.
  • Pre-built binaries are not yet available.
  • PCIe communication backend is not yet supported.
Health Check
Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
5
Issues (30d)
4
Star History
44 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Eric Zhang Eric Zhang(Founding Engineer at Modal), and
9 more.

DeepGEMM by deepseek-ai

0.4%
6k
CUDA library for efficient FP8 GEMM kernels with fine-grained scaling
Created 11 months ago
Updated 5 days ago
Feedback? Help us improve.