Triton-distributed  by ByteDance-Seed

Distributed compiler for computation-communication overlapping, based on Triton

created 4 months ago
930 stars

Top 40.1% on sourcepulse

GitHubView on GitHub
Project Summary

Triton-distributed is a distributed compiler based on OpenAI Triton, designed to create efficient kernels for parallel systems by overlapping computation and communication. It targets researchers and engineers developing high-performance distributed AI models, offering primitives to simplify the creation of complex communication patterns and achieve performance comparable to or better than hand-tuned libraries.

How It Works

The project extends OpenAI's Triton with a set of low-level primitives for distributed programming. These primitives abstract complex communication operations like AllToAll and GEMM, allowing developers to focus on the computation-communication overlap. The design emphasizes enabling programmers to write kernels that match or exceed the performance of specialized libraries, leveraging hardware interconnects like NVLink and InfiniBand.

Quick Start & Requirements

  • Install from source.
  • Requires NVIDIA GPUs (SM80, SM89, SM90a) or AMD GPUs (CDNA3).
  • Supports NVLink and InfiniBand for communication.
  • See Build Guide for detailed instructions.

Highlighted Details

  • Achieves 137us for AllToAll on 32 H800 GPUs (128 tokens/rank, fp8), outperforming DeepEP.
  • Provides example implementations for distributed GEMM, MoE, and Flash-Decoding kernels.
  • Supports cross-node communication and computation-communication overlapping kernels.
  • Includes performance benchmarks on NVIDIA H800 GPUs and scaling analysis.

Maintenance & Community

  • Developed by the ByteDance Seed Team.
  • Contributions via issues and pull requests.
  • No explicit community channels (Discord/Slack) mentioned.

Licensing & Compatibility

  • MIT License for the core project.
  • Apache-2.0 License for specific kernel implementations (e.g., flash_decode.py) and parts of Triton's original code.
  • Compatible with commercial use, but note the dual licensing for specific components.

Limitations & Caveats

  • Currently only low-level primitives are released; high-level primitives and tutorials are planned.
  • Pre-built binaries are not yet available.
  • PCIe communication backend is not yet supported.
Health Check
Last commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)
23
Issues (30d)
17
Star History
298 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
5 more.

Liger-Kernel by linkedin

0.6%
5k
Triton kernels for efficient LLM training
created 1 year ago
updated 1 day ago
Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Anton Bukov Anton Bukov(Cofounder of 1inch Network), and
16 more.

tinygrad by tinygrad

0.1%
30k
Minimalist deep learning framework for education and exploration
created 4 years ago
updated 19 hours ago
Feedback? Help us improve.