tensorrtllm_backend  by triton-inference-server

Triton backend for serving TensorRT-LLM models

created 1 year ago
871 stars

Top 42.1% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides the Triton Inference Server backend for TensorRT-LLM, enabling efficient serving of large language models. It targets developers and researchers needing high-performance LLM inference, offering features like in-flight batching and paged attention for optimized throughput and latency.

How It Works

The backend leverages TensorRT-LLM's optimized kernels and graph optimizations for LLM inference. It integrates with Triton's C++ backend API, supporting advanced features like in-flight batching for dynamic batching of requests, paged attention for efficient KV cache management, and various decoding strategies (Top-k, Top-p, Beam Search, Speculative Decoding). This approach allows for maximum GPU utilization and reduced memory overhead.

Quick Start & Requirements

  • Install/Run: Launch Triton with the TensorRT-LLM container (nvcr.io/nvidia/tritonserver:<xx.yy>-trtllm-python-py3).
  • Prerequisites: NVIDIA GPU, CUDA >= 12, Python 3.x, Triton Inference Server. TensorRT-LLM engines must be prepared for the specific model.
  • Setup: Requires building TensorRT-LLM engines, which can take significant time depending on the model size and complexity.
  • Docs: Triton Backend Repo, TensorRT-LLM Repo

Highlighted Details

  • Supports in-flight batching and paged attention for efficient LLM serving.
  • Offers multiple decoding strategies including Top-k, Top-p, Beam Search, Medusa, ReDrafter, Lookahead, and Eagle.
  • Enables model parallelism (Tensor Parallelism, Pipeline Parallelism, Expert Parallelism) and MIG support.
  • Provides comprehensive benchmarking tools and Triton metrics for performance monitoring.

Maintenance & Community

  • Developed and maintained by NVIDIA.
  • Community support and questions can be directed to the issues page.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatible with commercial use and closed-source applications.

Limitations & Caveats

The setup process for preparing TensorRT-LLM engines is complex and time-consuming. Orchestrator mode's compatibility with Slurm deployments may require specific configurations. Performance numbers are highly dependent on the specific GPU hardware used.

Health Check
Last commit

19 hours ago

Responsiveness

1 week

Pull Requests (30d)
6
Issues (30d)
0
Star History
43 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
2 more.

S-LoRA by S-LoRA

0.1%
2k
System for scalable LoRA adapter serving
created 1 year ago
updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
5 more.

Liger-Kernel by linkedin

0.6%
5k
Triton kernels for efficient LLM training
created 1 year ago
updated 1 day ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
6 more.

FasterTransformer by NVIDIA

0.2%
6k
Optimized transformer library for inference
created 4 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
5 more.

TensorRT-LLM by NVIDIA

0.6%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 1 year ago
updated 17 hours ago
Feedback? Help us improve.