TensorRT-LLM  by NVIDIA

LLM inference optimization SDK for NVIDIA GPUs

Created 2 years ago
12,594 stars

Top 4.0% on SourcePulse

GitHubView on GitHub
Project Summary

NVIDIA TensorRT-LLM is an open-source library designed to optimize Large Language Model (LLM) inference on NVIDIA GPUs. It provides a Python API for defining LLMs and incorporates advanced optimizations like custom attention kernels, in-flight batching, paged KV caching, and various quantization techniques (FP8, FP4, INT4, INT8). The library targets researchers and developers seeking to maximize inference performance, reduce latency, and lower costs for LLM deployments.

How It Works

TensorRT-LLM offers two backends: a PyTorch backend for flexible development and rapid iteration, and a traditional TensorRT backend for ahead-of-time compilation into highly optimized "Engines." This dual-backend approach allows users to leverage the ease of PyTorch for experimentation while still achieving peak performance for deployment. The library includes a unified LLM API to simplify model setup and inference across both backends, supporting distributed inference via Tensor and Pipeline Parallelism. It also integrates with the NVIDIA Triton Inference Server for production deployments.

Quick Start & Requirements

  • Installation: Typically via pip or building from source. Specific installation guides are available for Linux and Grace Hopper.
  • Prerequisites: NVIDIA GPU, CUDA (e.g., 12.8.1), TensorRT (e.g., 10.9.0), Python (3.10, 3.12). Support matrix for hardware, models, and software is provided.
  • Resources: GPU memory requirements depend on the LLM size.
  • Links: Quick Start Guide, Installation Guide, Supported Hardware.

Highlighted Details

  • Achieves state-of-the-art performance, e.g., over 40,000 tokens/sec on B200 GPUs for Llama 4.
  • Supports advanced optimizations like speculative decoding, INT4 AWQ, and FP4 quantization.
  • Offers a backend for Triton Inference Server for production deployment.
  • Provides pre-defined models and the ability to extend them using native PyTorch or a Python API.

Maintenance & Community

The project is actively developed by NVIDIA, with frequent updates and new features. It is fully open-source with development moved to GitHub. Links to documentation, examples, and a roadmap are provided.

Licensing & Compatibility

The project is licensed under the Apache 2.0 license, which permits commercial use and linking with closed-source software.

Limitations & Caveats

While highly optimized for NVIDIA hardware, TensorRT-LLM's performance benefits are tied to specific NVIDIA GPU architectures and CUDA/TensorRT versions. Some advanced features or specific model optimizations might be experimental or require specific hardware configurations.

Health Check
Last Commit

11 hours ago

Responsiveness

1 day

Pull Requests (30d)
614
Issues (30d)
257
Star History
246 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Nikola Borisov Nikola Borisov(Founder and CEO of DeepInfra), and
3 more.

tensorrtllm_backend by triton-inference-server

0.1%
912
Triton backend for serving TensorRT-LLM models
Created 2 years ago
Updated 2 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

0.7%
995
LLM inference engine for diverse applications
Created 2 years ago
Updated 9 hours ago
Starred by Peter Norvig Peter Norvig(Author of "Artificial Intelligence: A Modern Approach"; Research Director at Google), Alexey Milovidov Alexey Milovidov(Cofounder of Clickhouse), and
29 more.

llm.c by karpathy

0.2%
29k
LLM training in pure C/CUDA, no PyTorch needed
Created 1 year ago
Updated 6 months ago
Feedback? Help us improve.