TileRT  by tile-ai

Ultra-low-latency LLM inference runtime

Created 2 weeks ago

New!

261 stars

Top 97.5% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

TileRT addresses the challenge of ultra-low-latency inference for Large Language Models (LLMs), targeting applications demanding extreme responsiveness like interactive AI and real-time decision-making. It offers a novel tile-based runtime engine designed to push LLM latency boundaries, enabling large models to achieve millisecond-level response times without compromising quality.

How It Works

The core innovation is a tile-level runtime engine employing a compiler-driven approach. LLM operators are decomposed into fine-grained tile-level tasks. The runtime then meticulously reschedules compute, I/O, and communication across multiple devices in a highly overlapped manner. This strategy minimizes hardware idle time and maximizes utilization, leading to significant latency reductions.

Quick Start & Requirements

Installation is recommended via Docker (tileai/tilert:v0.1.0). Prerequisites include:

  • Hardware: 8 NVIDIA B200 GPUs.
  • OS: Linux x86_64 (Ubuntu 20.04+).
  • Python: 3.11 – 3.12.
  • PyTorch: Compiled for CUDA 12.8 or 12.9. Pre-converted model weights (e.g., Tile-AI/DeepSeek-V3.2-Exp-TileRT) must be downloaded from HuggingFace.

Highlighted Details

  • Aims for millisecond-level Time-to-First-Token (TPOT) for hundreds of billions of parameter models.
  • Preliminary evaluation on DeepSeek-V3.2-Exp (batch size 1, 1K/1K seqlen) using 8x NVIDIA B200 GPUs showed significant outperformance over SGLang and vLLM.
  • Evaluations were conducted without lossy optimizations like quantization or distillation.

Maintenance & Community

TileRT is presented as an experimental, continuously evolving project with a preview release. Future updates are anticipated to enhance performance and expand support. Specific community channels (e.g., Discord, Slack) or a public roadmap are not detailed in the README.

Licensing & Compatibility

The README does not specify a software license. This omission requires clarification for any adoption decision, particularly regarding commercial use or integration into closed-source projects. Compatibility is currently limited to specific Linux environments with high-end NVIDIA hardware and CUDA versions.

Limitations & Caveats

As an experimental preview, TileRT has limitations. The current build is restricted to an 8-GPU B200 setup. Users must adhere to specific hardware, OS, and CUDA version requirements. Installation within the provided Docker image is strongly advised for stability.

Health Check
Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
6
Issues (30d)
2
Star History
263 stars in the last 19 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.2%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 3 days ago
Feedback? Help us improve.