TensorRT-LLM  by NVIDIA

LLM inference optimization SDK for NVIDIA GPUs

Created 2 years ago
11,613 stars

Top 4.4% on SourcePulse

GitHubView on GitHub
Project Summary

NVIDIA TensorRT-LLM is an open-source library designed to optimize Large Language Model (LLM) inference on NVIDIA GPUs. It provides a Python API for defining LLMs and incorporates advanced optimizations like custom attention kernels, in-flight batching, paged KV caching, and various quantization techniques (FP8, FP4, INT4, INT8). The library targets researchers and developers seeking to maximize inference performance, reduce latency, and lower costs for LLM deployments.

How It Works

TensorRT-LLM offers two backends: a PyTorch backend for flexible development and rapid iteration, and a traditional TensorRT backend for ahead-of-time compilation into highly optimized "Engines." This dual-backend approach allows users to leverage the ease of PyTorch for experimentation while still achieving peak performance for deployment. The library includes a unified LLM API to simplify model setup and inference across both backends, supporting distributed inference via Tensor and Pipeline Parallelism. It also integrates with the NVIDIA Triton Inference Server for production deployments.

Quick Start & Requirements

  • Installation: Typically via pip or building from source. Specific installation guides are available for Linux and Grace Hopper.
  • Prerequisites: NVIDIA GPU, CUDA (e.g., 12.8.1), TensorRT (e.g., 10.9.0), Python (3.10, 3.12). Support matrix for hardware, models, and software is provided.
  • Resources: GPU memory requirements depend on the LLM size.
  • Links: Quick Start Guide, Installation Guide, Supported Hardware.

Highlighted Details

  • Achieves state-of-the-art performance, e.g., over 40,000 tokens/sec on B200 GPUs for Llama 4.
  • Supports advanced optimizations like speculative decoding, INT4 AWQ, and FP4 quantization.
  • Offers a backend for Triton Inference Server for production deployment.
  • Provides pre-defined models and the ability to extend them using native PyTorch or a Python API.

Maintenance & Community

The project is actively developed by NVIDIA, with frequent updates and new features. It is fully open-source with development moved to GitHub. Links to documentation, examples, and a roadmap are provided.

Licensing & Compatibility

The project is licensed under the Apache 2.0 license, which permits commercial use and linking with closed-source software.

Limitations & Caveats

While highly optimized for NVIDIA hardware, TensorRT-LLM's performance benefits are tied to specific NVIDIA GPU architectures and CUDA/TensorRT versions. Some advanced features or specific model optimizations might be experimental or require specific hardware configurations.

Health Check
Last Commit

12 hours ago

Responsiveness

1 week

Pull Requests (30d)
804
Issues (30d)
141
Star History
271 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Nikola Borisov Nikola Borisov(Founder and CEO of DeepInfra), and
3 more.

tensorrtllm_backend by triton-inference-server

0.2%
889
Triton backend for serving TensorRT-LLM models
Created 2 years ago
Updated 1 day ago
Starred by Peter Norvig Peter Norvig(Author of "Artificial Intelligence: A Modern Approach"; Research Director at Google), Alexey Milovidov Alexey Milovidov(Cofounder of Clickhouse), and
29 more.

llm.c by karpathy

0.2%
28k
LLM training in pure C/CUDA, no PyTorch needed
Created 1 year ago
Updated 2 months ago
Feedback? Help us improve.