optimum-nvidia  by huggingface

SDK for optimized inference on NVIDIA hardware

Created 2 years ago
1,026 stars

Top 36.5% on SourcePulse

GitHubView on GitHub
Project Summary

Optimum-NVIDIA provides optimized inference for Hugging Face Transformers models on NVIDIA GPUs, targeting researchers and developers seeking maximum throughput. It enables significant speedups, claiming up to 28x faster inference for models like LLaMA 2, by leveraging TensorRT-LLM.

How It Works

This library integrates Hugging Face's Transformers ecosystem with NVIDIA's TensorRT-LLM, a high-performance inference runtime. It achieves speedups by compiling models into optimized TensorRT engines, utilizing techniques like FP8 precision (on Hopper and Ada-Lovelace architectures) and efficient kernel implementations for LLM operations. This approach minimizes overhead and maximizes hardware utilization for faster inference.

Quick Start & Requirements

  • Installation: pip install --pre --extra-index-url https://pypi.nvidia.com optimum-nvidia (Ubuntu validated). Docker images are available.
  • Prerequisites: Python 3.10, CUDA 12.6, TensorRT-LLM 0.15.0. Requires NVIDIA GPUs (Ampere, Hopper, Ada-Lovelace architectures). FP8 support is limited to Hopper and Ada-Lovelace.
  • Setup: Installation via pip requires system packages (python3.10, python3-pip, openmpi-bin, libopenmpi-dev). Building from source involves cloning and compiling TensorRT-LLM.
  • Docs: https://huggingface.co/docs/optimum/index

Highlighted Details

  • Achieves 1,200 tokens/second for LLaMA 2.
  • Minimal code changes required to leverage optimizations (single line modification).
  • Supports Hugging Face pipeline and generate() APIs.
  • Tested on 4090, L40S, H100 GPUs.

Maintenance & Community

The project is actively developed by Hugging Face and NVIDIA. Contributing guidelines are available.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Compatible with commercial use and closed-source linking.

Limitations & Caveats

Currently validated on Ubuntu only, with Windows support planned. Text generation support is primarily for LLaMAForCausalLM, with expansion to other architectures ongoing. FP8 support is hardware-dependent.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 30 days

Explore Similar Projects

Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
4 more.

parallelformers by tunib-ai

0%
791
Toolkit for easy model parallelization
Created 4 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

0.7%
995
LLM inference engine for diverse applications
Created 2 years ago
Updated 9 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.1%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 month ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Alex Yu Alex Yu(Research Scientist at OpenAI; Cofounder of Luma AI), and
7 more.

TransformerEngine by NVIDIA

0.9%
3k
Library for Transformer model acceleration on NVIDIA GPUs
Created 3 years ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
9 more.

FlashMLA by deepseek-ai

0.1%
12k
Efficient CUDA kernels for MLA decoding
Created 10 months ago
Updated 3 weeks ago
Feedback? Help us improve.