tensorrtllm_backend  by triton-inference-server

Triton backend for serving TensorRT-LLM models

Created 2 years ago
912 stars

Top 39.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the Triton Inference Server backend for TensorRT-LLM, enabling efficient serving of large language models. It targets developers and researchers needing high-performance LLM inference, offering features like in-flight batching and paged attention for optimized throughput and latency.

How It Works

The backend leverages TensorRT-LLM's optimized kernels and graph optimizations for LLM inference. It integrates with Triton's C++ backend API, supporting advanced features like in-flight batching for dynamic batching of requests, paged attention for efficient KV cache management, and various decoding strategies (Top-k, Top-p, Beam Search, Speculative Decoding). This approach allows for maximum GPU utilization and reduced memory overhead.

Quick Start & Requirements

  • Install/Run: Launch Triton with the TensorRT-LLM container (nvcr.io/nvidia/tritonserver:<xx.yy>-trtllm-python-py3).
  • Prerequisites: NVIDIA GPU, CUDA >= 12, Python 3.x, Triton Inference Server. TensorRT-LLM engines must be prepared for the specific model.
  • Setup: Requires building TensorRT-LLM engines, which can take significant time depending on the model size and complexity.
  • Docs: Triton Backend Repo, TensorRT-LLM Repo

Highlighted Details

  • Supports in-flight batching and paged attention for efficient LLM serving.
  • Offers multiple decoding strategies including Top-k, Top-p, Beam Search, Medusa, ReDrafter, Lookahead, and Eagle.
  • Enables model parallelism (Tensor Parallelism, Pipeline Parallelism, Expert Parallelism) and MIG support.
  • Provides comprehensive benchmarking tools and Triton metrics for performance monitoring.

Maintenance & Community

  • Developed and maintained by NVIDIA.
  • Community support and questions can be directed to the issues page.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatible with commercial use and closed-source applications.

Limitations & Caveats

The setup process for preparing TensorRT-LLM engines is complex and time-consuming. Orchestrator mode's compatibility with Slurm deployments may require specific configurations. Performance numbers are highly dependent on the specific GPU hardware used.

Health Check
Last Commit

2 days ago

Responsiveness

1 week

Pull Requests (30d)
6
Issues (30d)
1
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Coauthor of SGLang) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm-analysis by cli99

0%
475
CLI tool for LLM latency/memory analysis during training/inference
Created 2 years ago
Updated 8 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

0.7%
995
LLM inference engine for diverse applications
Created 2 years ago
Updated 14 hours ago
Feedback? Help us improve.