Model-Optimizer by NVIDIA

Library for optimizing deep learning models for GPU inference

Created 1 year ago

1,785 stars

Top 23.9% on SourcePulse

View on GitHub

3 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Woosuk Kwon

Coauthor of vLLM

Morgan Funtowicz

Head of ML Optimizations at Hugging Face

Project Summary

NVIDIA TensorRT Model Optimizer is a Python library designed to compress and accelerate deep learning models for efficient inference on NVIDIA GPUs. It targets researchers and engineers working with large generative AI models, offering techniques like quantization, pruning, and speculative decoding to reduce model size and improve inference speed, seamlessly integrating with frameworks like TensorRT-LLM and TensorRT.

How It Works

The library provides Python APIs to apply various state-of-the-art optimization techniques to PyTorch or ONNX models. It supports advanced quantization algorithms (SmoothQuant, AWQ, SVDQuant, NVFP4, FP8, INT8, INT4) for both post-training quantization (PTQ) and quantization-aware training (QAT). Additionally, it includes implementations for pruning (weights, attention heads, MLP, embedding dimensions, depth), knowledge distillation, speculative decoding (Medusa, EAGLE), and sparsity patterns (NVIDIA 2:4, ASP, SparseGPT). The output is an optimized checkpoint ready for deployment.

Quick Start & Requirements

Installation: Recommended via Docker image (./docker/build.sh, docker run ...). Alternatively, pip install "nvidia-modelopt[all]" -U --extra-index-url https://pypi.nvidia.com. Install from source with pip install -e ".[all]" --extra-index-url https://pypi.nvidia.com.
Prerequisites: NVIDIA Container Toolkit for Docker. Python environment.
Resources: Docker build may take time. Inference acceleration benefits are hardware-dependent.
Links: Documentation, Examples, Support Matrix

Highlighted Details

Supports advanced quantization formats like NVFP4, FP8, INT8, INT4 with algorithms like SmoothQuant, AWQ, SVDQuant.
Integrates with NVIDIA NeMo and Megatron-LM for training-in-the-loop optimization.
Offers speculative decoding algorithms (Medusa, EAGLE) for faster token generation.
Quantized checkpoints are available on Hugging Face for direct deployment with TensorRT-LLM and vLLM.

Maintenance & Community

Open-sourced in January 2025.
Active development with frequent updates and new model support (e.g., Llama 3.1, DeepSeek-R1).
Contributions are welcomed.
Roadmap available.

Licensing & Compatibility

Licensed under Apache 2.0.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The library is primarily focused on NVIDIA GPU acceleration, and full functionality, especially for deployment examples, is best achieved using the provided Docker images. Some features, like vLLM deployment, are noted as experimental.

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

148 stars in the last 30 days