TileRT by tile-ai

Ultra-low-latency LLM inference runtime

Created 4 months ago

680 stars

Top 49.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Project Summary

Summary

TileRT addresses the challenge of ultra-low-latency inference for Large Language Models (LLMs), targeting applications demanding extreme responsiveness like interactive AI and real-time decision-making. It offers a novel tile-based runtime engine designed to push LLM latency boundaries, enabling large models to achieve millisecond-level response times without compromising quality.

How It Works

The core innovation is a tile-level runtime engine employing a compiler-driven approach. LLM operators are decomposed into fine-grained tile-level tasks. The runtime then meticulously reschedules compute, I/O, and communication across multiple devices in a highly overlapped manner. This strategy minimizes hardware idle time and maximizes utilization, leading to significant latency reductions.

Quick Start & Requirements

Installation is recommended via Docker (tileai/tilert:v0.1.0). Prerequisites include:

Hardware: 8 NVIDIA B200 GPUs.
OS: Linux x86_64 (Ubuntu 20.04+).
Python: 3.11 – 3.12.
PyTorch: Compiled for CUDA 12.8 or 12.9. Pre-converted model weights (e.g., Tile-AI/DeepSeek-V3.2-Exp-TileRT) must be downloaded from HuggingFace.

Highlighted Details

Aims for millisecond-level Time-to-First-Token (TPOT) for hundreds of billions of parameter models.
Preliminary evaluation on DeepSeek-V3.2-Exp (batch size 1, 1K/1K seqlen) using 8x NVIDIA B200 GPUs showed significant outperformance over SGLang and vLLM.
Evaluations were conducted without lossy optimizations like quantization or distillation.

Maintenance & Community

TileRT is presented as an experimental, continuously evolving project with a preview release. Future updates are anticipated to enhance performance and expand support. Specific community channels (e.g., Discord, Slack) or a public roadmap are not detailed in the README.

Licensing & Compatibility

The README does not specify a software license. This omission requires clarification for any adoption decision, particularly regarding commercial use or integration into closed-source projects. Compatibility is currently limited to specific Linux environments with high-end NVIDIA hardware and CUDA versions.

Limitations & Caveats

As an experimental preview, TileRT has limitations. The current build is restricted to an 8-GPU B200 setup. Users must adhere to specific hardware, OS, and CUDA version requirements. Installation within the provided Docker image is strongly advised for stability.

Health Check

Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

118 stars in the last 30 days