SDK for optimized inference on NVIDIA hardware
Top 38.3% on sourcepulse
Optimum-NVIDIA provides optimized inference for Hugging Face Transformers models on NVIDIA GPUs, targeting researchers and developers seeking maximum throughput. It enables significant speedups, claiming up to 28x faster inference for models like LLaMA 2, by leveraging TensorRT-LLM.
How It Works
This library integrates Hugging Face's Transformers ecosystem with NVIDIA's TensorRT-LLM, a high-performance inference runtime. It achieves speedups by compiling models into optimized TensorRT engines, utilizing techniques like FP8 precision (on Hopper and Ada-Lovelace architectures) and efficient kernel implementations for LLM operations. This approach minimizes overhead and maximizes hardware utilization for faster inference.
Quick Start & Requirements
pip install --pre --extra-index-url https://pypi.nvidia.com optimum-nvidia
(Ubuntu validated). Docker images are available.python3.10
, python3-pip
, openmpi-bin
, libopenmpi-dev
). Building from source involves cloning and compiling TensorRT-LLM.Highlighted Details
pipeline
and generate()
APIs.Maintenance & Community
The project is actively developed by Hugging Face and NVIDIA. Contributing guidelines are available.
Licensing & Compatibility
Limitations & Caveats
Currently validated on Ubuntu only, with Windows support planned. Text generation support is primarily for LLaMAForCausalLM
, with expansion to other architectures ongoing. FP8 support is hardware-dependent.
5 months ago
1 week