TensorRT-LLM  by NVIDIA

LLM inference optimization SDK for NVIDIA GPUs

created 1 year ago
11,177 stars

Top 4.6% on sourcepulse

GitHubView on GitHub
Project Summary

NVIDIA TensorRT-LLM is an open-source library designed to optimize Large Language Model (LLM) inference on NVIDIA GPUs. It provides a Python API for defining LLMs and incorporates advanced optimizations like custom attention kernels, in-flight batching, paged KV caching, and various quantization techniques (FP8, FP4, INT4, INT8). The library targets researchers and developers seeking to maximize inference performance, reduce latency, and lower costs for LLM deployments.

How It Works

TensorRT-LLM offers two backends: a PyTorch backend for flexible development and rapid iteration, and a traditional TensorRT backend for ahead-of-time compilation into highly optimized "Engines." This dual-backend approach allows users to leverage the ease of PyTorch for experimentation while still achieving peak performance for deployment. The library includes a unified LLM API to simplify model setup and inference across both backends, supporting distributed inference via Tensor and Pipeline Parallelism. It also integrates with the NVIDIA Triton Inference Server for production deployments.

Quick Start & Requirements

  • Installation: Typically via pip or building from source. Specific installation guides are available for Linux and Grace Hopper.
  • Prerequisites: NVIDIA GPU, CUDA (e.g., 12.8.1), TensorRT (e.g., 10.9.0), Python (3.10, 3.12). Support matrix for hardware, models, and software is provided.
  • Resources: GPU memory requirements depend on the LLM size.
  • Links: Quick Start Guide, Installation Guide, Supported Hardware.

Highlighted Details

  • Achieves state-of-the-art performance, e.g., over 40,000 tokens/sec on B200 GPUs for Llama 4.
  • Supports advanced optimizations like speculative decoding, INT4 AWQ, and FP4 quantization.
  • Offers a backend for Triton Inference Server for production deployment.
  • Provides pre-defined models and the ability to extend them using native PyTorch or a Python API.

Maintenance & Community

The project is actively developed by NVIDIA, with frequent updates and new features. It is fully open-source with development moved to GitHub. Links to documentation, examples, and a roadmap are provided.

Licensing & Compatibility

The project is licensed under the Apache 2.0 license, which permits commercial use and linking with closed-source software.

Limitations & Caveats

While highly optimized for NVIDIA hardware, TensorRT-LLM's performance benefits are tied to specific NVIDIA GPU architectures and CUDA/TensorRT versions. Some advanced features or specific model optimizations might be experimental or require specific hardware configurations.

Health Check
Last commit

14 hours ago

Responsiveness

Inactive

Pull Requests (30d)
728
Issues (30d)
132
Star History
863 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
5 more.

Liger-Kernel by linkedin

0.6%
5k
Triton kernels for efficient LLM training
created 1 year ago
updated 1 day ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

ktransformers by kvcache-ai

0.4%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 2 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 10 hours ago
Feedback? Help us improve.