TileRT  by tile-ai

Ultra-low-latency LLM inference runtime

Created 5 months ago
714 stars

Top 47.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

TileRT addresses the challenge of ultra-low-latency inference for Large Language Models (LLMs), targeting applications demanding extreme responsiveness like interactive AI and real-time decision-making. It offers a novel tile-based runtime engine designed to push LLM latency boundaries, enabling large models to achieve millisecond-level response times without compromising quality.

How It Works

The core innovation is a tile-level runtime engine employing a compiler-driven approach. LLM operators are decomposed into fine-grained tile-level tasks. The runtime then meticulously reschedules compute, I/O, and communication across multiple devices in a highly overlapped manner. This strategy minimizes hardware idle time and maximizes utilization, leading to significant latency reductions.

Quick Start & Requirements

Installation is recommended via Docker (tileai/tilert:v0.1.0). Prerequisites include:

  • Hardware: 8 NVIDIA B200 GPUs.
  • OS: Linux x86_64 (Ubuntu 20.04+).
  • Python: 3.11 – 3.12.
  • PyTorch: Compiled for CUDA 12.8 or 12.9. Pre-converted model weights (e.g., Tile-AI/DeepSeek-V3.2-Exp-TileRT) must be downloaded from HuggingFace.

Highlighted Details

  • Aims for millisecond-level Time-to-First-Token (TPOT) for hundreds of billions of parameter models.
  • Preliminary evaluation on DeepSeek-V3.2-Exp (batch size 1, 1K/1K seqlen) using 8x NVIDIA B200 GPUs showed significant outperformance over SGLang and vLLM.
  • Evaluations were conducted without lossy optimizations like quantization or distillation.

Maintenance & Community

TileRT is presented as an experimental, continuously evolving project with a preview release. Future updates are anticipated to enhance performance and expand support. Specific community channels (e.g., Discord, Slack) or a public roadmap are not detailed in the README.

Licensing & Compatibility

The README does not specify a software license. This omission requires clarification for any adoption decision, particularly regarding commercial use or integration into closed-source projects. Compatibility is currently limited to specific Linux environments with high-end NVIDIA hardware and CUDA versions.

Limitations & Caveats

As an experimental preview, TileRT has limitations. The current build is restricted to an 8-GPU B200 setup. Users must adhere to specific hardware, OS, and CUDA version requirements. Installation within the provided Docker image is strongly advised for stability.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
31 stars in the last 30 days

Explore Similar Projects

Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai) and Carol Willing Carol Willing(Core Contributor to CPython, Jupyter).

ai-performance-engineering by cfregly

1.2%
1k
AI Systems Performance Engineering for modern AI workloads
Created 1 year ago
Updated 4 weeks ago
Feedback? Help us improve.