tensorrtllm_backend  by triton-inference-server

Triton backend for serving TensorRT-LLM models

Created 2 years ago
889 stars

Top 40.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the Triton Inference Server backend for TensorRT-LLM, enabling efficient serving of large language models. It targets developers and researchers needing high-performance LLM inference, offering features like in-flight batching and paged attention for optimized throughput and latency.

How It Works

The backend leverages TensorRT-LLM's optimized kernels and graph optimizations for LLM inference. It integrates with Triton's C++ backend API, supporting advanced features like in-flight batching for dynamic batching of requests, paged attention for efficient KV cache management, and various decoding strategies (Top-k, Top-p, Beam Search, Speculative Decoding). This approach allows for maximum GPU utilization and reduced memory overhead.

Quick Start & Requirements

  • Install/Run: Launch Triton with the TensorRT-LLM container (nvcr.io/nvidia/tritonserver:<xx.yy>-trtllm-python-py3).
  • Prerequisites: NVIDIA GPU, CUDA >= 12, Python 3.x, Triton Inference Server. TensorRT-LLM engines must be prepared for the specific model.
  • Setup: Requires building TensorRT-LLM engines, which can take significant time depending on the model size and complexity.
  • Docs: Triton Backend Repo, TensorRT-LLM Repo

Highlighted Details

  • Supports in-flight batching and paged attention for efficient LLM serving.
  • Offers multiple decoding strategies including Top-k, Top-p, Beam Search, Medusa, ReDrafter, Lookahead, and Eagle.
  • Enables model parallelism (Tensor Parallelism, Pipeline Parallelism, Expert Parallelism) and MIG support.
  • Provides comprehensive benchmarking tools and Triton metrics for performance monitoring.

Maintenance & Community

  • Developed and maintained by NVIDIA.
  • Community support and questions can be directed to the issues page.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatible with commercial use and closed-source applications.

Limitations & Caveats

The setup process for preparing TensorRT-LLM engines is complex and time-consuming. Orchestrator mode's compatibility with Slurm deployments may require specific configurations. Performance numbers are highly dependent on the specific GPU hardware used.

Health Check
Last Commit

1 day ago

Responsiveness

1 week

Pull Requests (30d)
10
Issues (30d)
1
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Coauthor of SGLang) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm-analysis by cli99

0.4%
455
CLI tool for LLM latency/memory analysis during training/inference
Created 2 years ago
Updated 5 months ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
20 more.

TensorRT-LLM by NVIDIA

0.5%
12k
LLM inference optimization SDK for NVIDIA GPUs
Created 2 years ago
Updated 12 hours ago
Feedback? Help us improve.