TensorRT-Model-Optimizer  by NVIDIA

Library for optimizing deep learning models for GPU inference

created 1 year ago
1,082 stars

Top 35.7% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

NVIDIA TensorRT Model Optimizer is a Python library designed to compress and accelerate deep learning models for efficient inference on NVIDIA GPUs. It targets researchers and engineers working with large generative AI models, offering techniques like quantization, pruning, and speculative decoding to reduce model size and improve inference speed, seamlessly integrating with frameworks like TensorRT-LLM and TensorRT.

How It Works

The library provides Python APIs to apply various state-of-the-art optimization techniques to PyTorch or ONNX models. It supports advanced quantization algorithms (SmoothQuant, AWQ, SVDQuant, NVFP4, FP8, INT8, INT4) for both post-training quantization (PTQ) and quantization-aware training (QAT). Additionally, it includes implementations for pruning (weights, attention heads, MLP, embedding dimensions, depth), knowledge distillation, speculative decoding (Medusa, EAGLE), and sparsity patterns (NVIDIA 2:4, ASP, SparseGPT). The output is an optimized checkpoint ready for deployment.

Quick Start & Requirements

  • Installation: Recommended via Docker image (./docker/build.sh, docker run ...). Alternatively, pip install "nvidia-modelopt[all]" -U --extra-index-url https://pypi.nvidia.com. Install from source with pip install -e ".[all]" --extra-index-url https://pypi.nvidia.com.
  • Prerequisites: NVIDIA Container Toolkit for Docker. Python environment.
  • Resources: Docker build may take time. Inference acceleration benefits are hardware-dependent.
  • Links: Documentation, Examples, Support Matrix

Highlighted Details

  • Supports advanced quantization formats like NVFP4, FP8, INT8, INT4 with algorithms like SmoothQuant, AWQ, SVDQuant.
  • Integrates with NVIDIA NeMo and Megatron-LM for training-in-the-loop optimization.
  • Offers speculative decoding algorithms (Medusa, EAGLE) for faster token generation.
  • Quantized checkpoints are available on Hugging Face for direct deployment with TensorRT-LLM and vLLM.

Maintenance & Community

  • Open-sourced in January 2025.
  • Active development with frequent updates and new model support (e.g., Llama 3.1, DeepSeek-R1).
  • Contributions are welcomed.
  • Roadmap available.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The library is primarily focused on NVIDIA GPU acceleration, and full functionality, especially for deployment examples, is best achieved using the provided Docker images. Some features, like vLLM deployment, are noted as experimental.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
13
Star History
195 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
5 more.

TensorRT-LLM by NVIDIA

0.6%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 1 year ago
updated 20 hours ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
12 more.

DeepSpeed by deepspeedai

0.2%
40k
Deep learning optimization library for distributed training and inference
created 5 years ago
updated 1 day ago
Feedback? Help us improve.