tensorrtllm_backend by triton-inference-server

Triton backend for serving TensorRT-LLM models

Created 2 years ago

912 stars

Top 39.9% on SourcePulse

View on GitHub

5 Experts Love This Project

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Nikola Borisov

Founder and CEO of DeepInfra

and 1 more!

Project Summary

This repository provides the Triton Inference Server backend for TensorRT-LLM, enabling efficient serving of large language models. It targets developers and researchers needing high-performance LLM inference, offering features like in-flight batching and paged attention for optimized throughput and latency.

How It Works

The backend leverages TensorRT-LLM's optimized kernels and graph optimizations for LLM inference. It integrates with Triton's C++ backend API, supporting advanced features like in-flight batching for dynamic batching of requests, paged attention for efficient KV cache management, and various decoding strategies (Top-k, Top-p, Beam Search, Speculative Decoding). This approach allows for maximum GPU utilization and reduced memory overhead.

Quick Start & Requirements

Install/Run: Launch Triton with the TensorRT-LLM container (nvcr.io/nvidia/tritonserver:<xx.yy>-trtllm-python-py3).
Prerequisites: NVIDIA GPU, CUDA >= 12, Python 3.x, Triton Inference Server. TensorRT-LLM engines must be prepared for the specific model.
Setup: Requires building TensorRT-LLM engines, which can take significant time depending on the model size and complexity.
Docs: Triton Backend Repo, TensorRT-LLM Repo

Highlighted Details

Supports in-flight batching and paged attention for efficient LLM serving.
Offers multiple decoding strategies including Top-k, Top-p, Beam Search, Medusa, ReDrafter, Lookahead, and Eagle.
Enables model parallelism (Tensor Parallelism, Pipeline Parallelism, Expert Parallelism) and MIG support.
Provides comprehensive benchmarking tools and Triton metrics for performance monitoring.

Maintenance & Community

Developed and maintained by NVIDIA.
Community support and questions can be directed to the issues page.

Licensing & Compatibility

License: Apache 2.0.
Compatible with commercial use and closed-source applications.

Limitations & Caveats

The setup process for preparing TensorRT-LLM engines is complex and time-consuming. Orchestrator mode's compatibility with Slurm deployments may require specific configurations. Performance numbers are highly dependent on the specific GPU hardware used.

Health Check

Last Commit

2 days ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days