optimum-nvidia  by huggingface

SDK for optimized inference on NVIDIA hardware

Created 1 year ago
1,002 stars

Top 37.3% on SourcePulse

GitHubView on GitHub
Project Summary

Optimum-NVIDIA provides optimized inference for Hugging Face Transformers models on NVIDIA GPUs, targeting researchers and developers seeking maximum throughput. It enables significant speedups, claiming up to 28x faster inference for models like LLaMA 2, by leveraging TensorRT-LLM.

How It Works

This library integrates Hugging Face's Transformers ecosystem with NVIDIA's TensorRT-LLM, a high-performance inference runtime. It achieves speedups by compiling models into optimized TensorRT engines, utilizing techniques like FP8 precision (on Hopper and Ada-Lovelace architectures) and efficient kernel implementations for LLM operations. This approach minimizes overhead and maximizes hardware utilization for faster inference.

Quick Start & Requirements

  • Installation: pip install --pre --extra-index-url https://pypi.nvidia.com optimum-nvidia (Ubuntu validated). Docker images are available.
  • Prerequisites: Python 3.10, CUDA 12.6, TensorRT-LLM 0.15.0. Requires NVIDIA GPUs (Ampere, Hopper, Ada-Lovelace architectures). FP8 support is limited to Hopper and Ada-Lovelace.
  • Setup: Installation via pip requires system packages (python3.10, python3-pip, openmpi-bin, libopenmpi-dev). Building from source involves cloning and compiling TensorRT-LLM.
  • Docs: https://huggingface.co/docs/optimum/index

Highlighted Details

  • Achieves 1,200 tokens/second for LLaMA 2.
  • Minimal code changes required to leverage optimizations (single line modification).
  • Supports Hugging Face pipeline and generate() APIs.
  • Tested on 4090, L40S, H100 GPUs.

Maintenance & Community

The project is actively developed by Hugging Face and NVIDIA. Contributing guidelines are available.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Compatible with commercial use and closed-source linking.

Limitations & Caveats

Currently validated on Ubuntu only, with Windows support planned. Text generation support is primarily for LLaMAForCausalLM, with expansion to other architectures ongoing. FP8 support is hardware-dependent.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
4 more.

parallelformers by tunib-ai

0%
790
Toolkit for easy model parallelization
Created 4 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI), and
7 more.

TransformerEngine by NVIDIA

0.4%
3k
Library for Transformer model acceleration on NVIDIA GPUs
Created 3 years ago
Updated 19 hours ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
20 more.

TensorRT-LLM by NVIDIA

0.5%
12k
LLM inference optimization SDK for NVIDIA GPUs
Created 2 years ago
Updated 12 hours ago
Feedback? Help us improve.