Triton-distributed by ByteDance-Seed

Distributed compiler for computation-communication overlapping, based on Triton

Created 9 months ago

1,309 stars

Top 30.3% on SourcePulse

View on GitHub

2 Experts Love This Project

Lei Zhang

Director Engineering AI at AMD

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Project Summary

Triton-distributed is a distributed compiler based on OpenAI Triton, designed to create efficient kernels for parallel systems by overlapping computation and communication. It targets researchers and engineers developing high-performance distributed AI models, offering primitives to simplify the creation of complex communication patterns and achieve performance comparable to or better than hand-tuned libraries.

How It Works

The project extends OpenAI's Triton with a set of low-level primitives for distributed programming. These primitives abstract complex communication operations like AllToAll and GEMM, allowing developers to focus on the computation-communication overlap. The design emphasizes enabling programmers to write kernels that match or exceed the performance of specialized libraries, leveraging hardware interconnects like NVLink and InfiniBand.

Quick Start & Requirements

Install from source.
Requires NVIDIA GPUs (SM80, SM89, SM90a) or AMD GPUs (CDNA3).
Supports NVLink and InfiniBand for communication.
See Build Guide for detailed instructions.

Highlighted Details

Achieves 137us for AllToAll on 32 H800 GPUs (128 tokens/rank, fp8), outperforming DeepEP.
Provides example implementations for distributed GEMM, MoE, and Flash-Decoding kernels.
Supports cross-node communication and computation-communication overlapping kernels.
Includes performance benchmarks on NVIDIA H800 GPUs and scaling analysis.

Maintenance & Community

Developed by the ByteDance Seed Team.
Contributions via issues and pull requests.
No explicit community channels (Discord/Slack) mentioned.

Licensing & Compatibility

MIT License for the core project.
Apache-2.0 License for specific kernel implementations (e.g., flash_decode.py) and parts of Triton's original code.
Compatible with commercial use, but note the dual licensing for specific components.

Limitations & Caveats

Currently only low-level primitives are released; high-level primitives and tutorials are planned.
Pre-built binaries are not yet available.
PCIe communication backend is not yet supported.

Health Check

Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

44 stars in the last 30 days