optimum-nvidia by huggingface

SDK for optimized inference on NVIDIA hardware

Created 2 years ago

1,026 stars

Top 36.5% on SourcePulse

View on GitHub

4 Experts Love This Project

Tobi Lutke

Cofounder of Shopify

Travis Addair

Cofounder of Predibase

Philipp Schmid

DevRel at Google DeepMind

Morgan Funtowicz

Head of ML Optimizations at Hugging Face

Project Summary

Optimum-NVIDIA provides optimized inference for Hugging Face Transformers models on NVIDIA GPUs, targeting researchers and developers seeking maximum throughput. It enables significant speedups, claiming up to 28x faster inference for models like LLaMA 2, by leveraging TensorRT-LLM.

How It Works

This library integrates Hugging Face's Transformers ecosystem with NVIDIA's TensorRT-LLM, a high-performance inference runtime. It achieves speedups by compiling models into optimized TensorRT engines, utilizing techniques like FP8 precision (on Hopper and Ada-Lovelace architectures) and efficient kernel implementations for LLM operations. This approach minimizes overhead and maximizes hardware utilization for faster inference.

Quick Start & Requirements

Installation: pip install --pre --extra-index-url https://pypi.nvidia.com optimum-nvidia (Ubuntu validated). Docker images are available.
Prerequisites: Python 3.10, CUDA 12.6, TensorRT-LLM 0.15.0. Requires NVIDIA GPUs (Ampere, Hopper, Ada-Lovelace architectures). FP8 support is limited to Hopper and Ada-Lovelace.
Setup: Installation via pip requires system packages (python3.10, python3-pip, openmpi-bin, libopenmpi-dev). Building from source involves cloning and compiling TensorRT-LLM.
Docs: https://huggingface.co/docs/optimum/index

Highlighted Details

Achieves 1,200 tokens/second for LLaMA 2.
Minimal code changes required to leverage optimizations (single line modification).
Supports Hugging Face pipeline and generate() APIs.
Tested on 4090, L40S, H100 GPUs.

Maintenance & Community

The project is actively developed by Hugging Face and NVIDIA. Contributing guidelines are available.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: Compatible with commercial use and closed-source linking.

Limitations & Caveats

Currently validated on Ubuntu only, with Windows support planned. Text generation support is primarily for LLaMAForCausalLM, with expansion to other architectures ongoing. FP8 support is hardware-dependent.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days