Triton backend for serving TensorRT-LLM models
Top 42.1% on sourcepulse
This repository provides the Triton Inference Server backend for TensorRT-LLM, enabling efficient serving of large language models. It targets developers and researchers needing high-performance LLM inference, offering features like in-flight batching and paged attention for optimized throughput and latency.
How It Works
The backend leverages TensorRT-LLM's optimized kernels and graph optimizations for LLM inference. It integrates with Triton's C++ backend API, supporting advanced features like in-flight batching for dynamic batching of requests, paged attention for efficient KV cache management, and various decoding strategies (Top-k, Top-p, Beam Search, Speculative Decoding). This approach allows for maximum GPU utilization and reduced memory overhead.
Quick Start & Requirements
nvcr.io/nvidia/tritonserver:<xx.yy>-trtllm-python-py3
).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The setup process for preparing TensorRT-LLM engines is complex and time-consuming. Orchestrator mode's compatibility with Slurm deployments may require specific configurations. Performance numbers are highly dependent on the specific GPU hardware used.
19 hours ago
1 week