TensorRT-LLM by NVIDIA

LLM inference optimization SDK for NVIDIA GPUs

Created 2 years ago

12,594 stars

Top 4.0% on SourcePulse

View on GitHub

23 Experts Love This Project

Beyang Liu

Cofounder of Sourcegraph

Jeff Hammerbacher

Cofounder of Cloudera

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

and 19 more!

Project Summary

NVIDIA TensorRT-LLM is an open-source library designed to optimize Large Language Model (LLM) inference on NVIDIA GPUs. It provides a Python API for defining LLMs and incorporates advanced optimizations like custom attention kernels, in-flight batching, paged KV caching, and various quantization techniques (FP8, FP4, INT4, INT8). The library targets researchers and developers seeking to maximize inference performance, reduce latency, and lower costs for LLM deployments.

How It Works

TensorRT-LLM offers two backends: a PyTorch backend for flexible development and rapid iteration, and a traditional TensorRT backend for ahead-of-time compilation into highly optimized "Engines." This dual-backend approach allows users to leverage the ease of PyTorch for experimentation while still achieving peak performance for deployment. The library includes a unified LLM API to simplify model setup and inference across both backends, supporting distributed inference via Tensor and Pipeline Parallelism. It also integrates with the NVIDIA Triton Inference Server for production deployments.

Quick Start & Requirements

Installation: Typically via pip or building from source. Specific installation guides are available for Linux and Grace Hopper.
Prerequisites: NVIDIA GPU, CUDA (e.g., 12.8.1), TensorRT (e.g., 10.9.0), Python (3.10, 3.12). Support matrix for hardware, models, and software is provided.
Resources: GPU memory requirements depend on the LLM size.
Links: Quick Start Guide, Installation Guide, Supported Hardware.

Highlighted Details

Achieves state-of-the-art performance, e.g., over 40,000 tokens/sec on B200 GPUs for Llama 4.
Supports advanced optimizations like speculative decoding, INT4 AWQ, and FP4 quantization.
Offers a backend for Triton Inference Server for production deployment.
Provides pre-defined models and the ability to extend them using native PyTorch or a Python API.

Maintenance & Community

The project is actively developed by NVIDIA, with frequent updates and new features. It is fully open-source with development moved to GitHub. Links to documentation, examples, and a roadmap are provided.

Licensing & Compatibility

The project is licensed under the Apache 2.0 license, which permits commercial use and linking with closed-source software.

Limitations & Caveats

While highly optimized for NVIDIA hardware, TensorRT-LLM's performance benefits are tied to specific NVIDIA GPU architectures and CUDA/TensorRT versions. Some advanced features or specific model optimizations might be experimental or require specific hardware configurations.

Health Check

Last Commit

11 hours ago

Responsiveness

1 day

Pull Requests (30d)

614

Issues (30d)

257

Star History

246 stars in the last 30 days