Library for optimizing deep learning models for GPU inference
Top 35.7% on sourcepulse
NVIDIA TensorRT Model Optimizer is a Python library designed to compress and accelerate deep learning models for efficient inference on NVIDIA GPUs. It targets researchers and engineers working with large generative AI models, offering techniques like quantization, pruning, and speculative decoding to reduce model size and improve inference speed, seamlessly integrating with frameworks like TensorRT-LLM and TensorRT.
How It Works
The library provides Python APIs to apply various state-of-the-art optimization techniques to PyTorch or ONNX models. It supports advanced quantization algorithms (SmoothQuant, AWQ, SVDQuant, NVFP4, FP8, INT8, INT4) for both post-training quantization (PTQ) and quantization-aware training (QAT). Additionally, it includes implementations for pruning (weights, attention heads, MLP, embedding dimensions, depth), knowledge distillation, speculative decoding (Medusa, EAGLE), and sparsity patterns (NVIDIA 2:4, ASP, SparseGPT). The output is an optimized checkpoint ready for deployment.
Quick Start & Requirements
./docker/build.sh
, docker run ...
). Alternatively, pip install "nvidia-modelopt[all]" -U --extra-index-url https://pypi.nvidia.com
. Install from source with pip install -e ".[all]" --extra-index-url https://pypi.nvidia.com
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The library is primarily focused on NVIDIA GPU acceleration, and full functionality, especially for deployment examples, is best achieved using the provided Docker images. Some features, like vLLM deployment, are noted as experimental.
2 weeks ago
1 day