optimum-nvidia  by huggingface

SDK for optimized inference on NVIDIA hardware

created 1 year ago
988 stars

Top 38.3% on sourcepulse

GitHubView on GitHub
Project Summary

Optimum-NVIDIA provides optimized inference for Hugging Face Transformers models on NVIDIA GPUs, targeting researchers and developers seeking maximum throughput. It enables significant speedups, claiming up to 28x faster inference for models like LLaMA 2, by leveraging TensorRT-LLM.

How It Works

This library integrates Hugging Face's Transformers ecosystem with NVIDIA's TensorRT-LLM, a high-performance inference runtime. It achieves speedups by compiling models into optimized TensorRT engines, utilizing techniques like FP8 precision (on Hopper and Ada-Lovelace architectures) and efficient kernel implementations for LLM operations. This approach minimizes overhead and maximizes hardware utilization for faster inference.

Quick Start & Requirements

  • Installation: pip install --pre --extra-index-url https://pypi.nvidia.com optimum-nvidia (Ubuntu validated). Docker images are available.
  • Prerequisites: Python 3.10, CUDA 12.6, TensorRT-LLM 0.15.0. Requires NVIDIA GPUs (Ampere, Hopper, Ada-Lovelace architectures). FP8 support is limited to Hopper and Ada-Lovelace.
  • Setup: Installation via pip requires system packages (python3.10, python3-pip, openmpi-bin, libopenmpi-dev). Building from source involves cloning and compiling TensorRT-LLM.
  • Docs: https://huggingface.co/docs/optimum/index

Highlighted Details

  • Achieves 1,200 tokens/second for LLaMA 2.
  • Minimal code changes required to leverage optimizations (single line modification).
  • Supports Hugging Face pipeline and generate() APIs.
  • Tested on 4090, L40S, H100 GPUs.

Maintenance & Community

The project is actively developed by Hugging Face and NVIDIA. Contributing guidelines are available.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Compatible with commercial use and closed-source linking.

Limitations & Caveats

Currently validated on Ubuntu only, with Windows support planned. Text generation support is primarily for LLaMAForCausalLM, with expansion to other architectures ongoing. FP8 support is hardware-dependent.

Health Check
Last commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
35 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
5 more.

Liger-Kernel by linkedin

0.6%
5k
Triton kernels for efficient LLM training
created 1 year ago
updated 2 days ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
6 more.

FasterTransformer by NVIDIA

0.2%
6k
Optimized transformer library for inference
created 4 years ago
updated 1 year ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

ktransformers by kvcache-ai

0.4%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 3 days ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
5 more.

TensorRT-LLM by NVIDIA

0.6%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 1 year ago
updated 22 hours ago
Feedback? Help us improve.