TensorRT-Model-Optimizer  by NVIDIA

Library for optimizing deep learning models for GPU inference

Created 1 year ago
1,371 stars

Top 29.4% on SourcePulse

GitHubView on GitHub
Project Summary

NVIDIA TensorRT Model Optimizer is a Python library designed to compress and accelerate deep learning models for efficient inference on NVIDIA GPUs. It targets researchers and engineers working with large generative AI models, offering techniques like quantization, pruning, and speculative decoding to reduce model size and improve inference speed, seamlessly integrating with frameworks like TensorRT-LLM and TensorRT.

How It Works

The library provides Python APIs to apply various state-of-the-art optimization techniques to PyTorch or ONNX models. It supports advanced quantization algorithms (SmoothQuant, AWQ, SVDQuant, NVFP4, FP8, INT8, INT4) for both post-training quantization (PTQ) and quantization-aware training (QAT). Additionally, it includes implementations for pruning (weights, attention heads, MLP, embedding dimensions, depth), knowledge distillation, speculative decoding (Medusa, EAGLE), and sparsity patterns (NVIDIA 2:4, ASP, SparseGPT). The output is an optimized checkpoint ready for deployment.

Quick Start & Requirements

  • Installation: Recommended via Docker image (./docker/build.sh, docker run ...). Alternatively, pip install "nvidia-modelopt[all]" -U --extra-index-url https://pypi.nvidia.com. Install from source with pip install -e ".[all]" --extra-index-url https://pypi.nvidia.com.
  • Prerequisites: NVIDIA Container Toolkit for Docker. Python environment.
  • Resources: Docker build may take time. Inference acceleration benefits are hardware-dependent.
  • Links: Documentation, Examples, Support Matrix

Highlighted Details

  • Supports advanced quantization formats like NVFP4, FP8, INT8, INT4 with algorithms like SmoothQuant, AWQ, SVDQuant.
  • Integrates with NVIDIA NeMo and Megatron-LM for training-in-the-loop optimization.
  • Offers speculative decoding algorithms (Medusa, EAGLE) for faster token generation.
  • Quantized checkpoints are available on Hugging Face for direct deployment with TensorRT-LLM and vLLM.

Maintenance & Community

  • Open-sourced in January 2025.
  • Active development with frequent updates and new model support (e.g., Llama 3.1, DeepSeek-R1).
  • Contributions are welcomed.
  • Roadmap available.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The library is primarily focused on NVIDIA GPU acceleration, and full functionality, especially for deployment examples, is best achieved using the provided Docker images. Some features, like vLLM deployment, are noted as experimental.

Health Check
Last Commit

20 hours ago

Responsiveness

1 day

Pull Requests (30d)
58
Issues (30d)
23
Star History
246 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Zack Li Zack Li(Cofounder of Nexa AI), and
4 more.

smoothquant by mit-han-lab

0.3%
2k
Post-training quantization research paper for large language models
Created 2 years ago
Updated 1 year ago
Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

neural-compressor by intel

0.2%
2k
Python library for model compression (quantization, pruning, distillation, NAS)
Created 5 years ago
Updated 17 hours ago
Feedback? Help us improve.