Triton-distributed  by ByteDance-Seed

Distributed compiler for computation-communication overlapping, based on Triton

Created 5 months ago
1,122 stars

Top 34.2% on SourcePulse

GitHubView on GitHub
Project Summary

Triton-distributed is a distributed compiler based on OpenAI Triton, designed to create efficient kernels for parallel systems by overlapping computation and communication. It targets researchers and engineers developing high-performance distributed AI models, offering primitives to simplify the creation of complex communication patterns and achieve performance comparable to or better than hand-tuned libraries.

How It Works

The project extends OpenAI's Triton with a set of low-level primitives for distributed programming. These primitives abstract complex communication operations like AllToAll and GEMM, allowing developers to focus on the computation-communication overlap. The design emphasizes enabling programmers to write kernels that match or exceed the performance of specialized libraries, leveraging hardware interconnects like NVLink and InfiniBand.

Quick Start & Requirements

  • Install from source.
  • Requires NVIDIA GPUs (SM80, SM89, SM90a) or AMD GPUs (CDNA3).
  • Supports NVLink and InfiniBand for communication.
  • See Build Guide for detailed instructions.

Highlighted Details

  • Achieves 137us for AllToAll on 32 H800 GPUs (128 tokens/rank, fp8), outperforming DeepEP.
  • Provides example implementations for distributed GEMM, MoE, and Flash-Decoding kernels.
  • Supports cross-node communication and computation-communication overlapping kernels.
  • Includes performance benchmarks on NVIDIA H800 GPUs and scaling analysis.

Maintenance & Community

  • Developed by the ByteDance Seed Team.
  • Contributions via issues and pull requests.
  • No explicit community channels (Discord/Slack) mentioned.

Licensing & Compatibility

  • MIT License for the core project.
  • Apache-2.0 License for specific kernel implementations (e.g., flash_decode.py) and parts of Triton's original code.
  • Compatible with commercial use, but note the dual licensing for specific components.

Limitations & Caveats

  • Currently only low-level primitives are released; high-level primitives and tutorials are planned.
  • Pre-built binaries are not yet available.
  • PCIe communication backend is not yet supported.
Health Check
Last Commit

20 hours ago

Responsiveness

1 day

Pull Requests (30d)
10
Issues (30d)
13
Star History
87 stars in the last 30 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
1 more.

VeOmni by ByteDance-Seed

3.4%
1k
Framework for scaling multimodal model training across accelerators
Created 5 months ago
Updated 3 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
11 more.

Liger-Kernel by linkedin

0.6%
6k
Triton kernels for efficient LLM training
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.