text-generation-inference by huggingface

Rust/Python/gRPC server for fast LLM text generation

Created 3 years ago

10,724 stars

Top 4.7% on SourcePulse

View on GitHub

40 Experts Love This Project

Tobi Lutke

Cofounder of Shopify

Clement Delangue

Cofounder of Hugging Face

Tim J. Baek

Founder of Open WebUI

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

and 36 more!

Project Summary

Text Generation Inference (TGI) is a high-performance Rust, Python, and gRPC server for deploying and serving large language models (LLMs). It's designed for production use, powering services like Hugging Chat and the Inference API, and targets developers and researchers needing efficient LLM inference.

How It Works

TGI leverages a Rust backend for performance-critical operations and Python for model integration. It supports advanced features like Tensor Parallelism for multi-GPU inference, continuous batching for increased throughput, and optimized transformers code using Flash Attention and Paged Attention. This architecture allows for low-latency, high-throughput text generation across a wide range of popular LLMs.

Quick Start & Requirements

Install/Run: Via Docker:

docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:3.2.3 --model-id HuggingFaceH4/zephyr-7b-beta

Prerequisites: NVIDIA GPUs with CUDA 12.2+ recommended. AMD GPUs (MI210, MI250) supported via ROCm. CPU is not the intended platform. NVIDIA Container Toolkit required for NVIDIA GPUs.
Setup: Docker is the easiest method. Local install requires Rust and Python 3.9+.
Docs: Quick Tour, API Docs

Highlighted Details

Supports continuous batching and token streaming (SSE).
Offers OpenAI Chat Completion API compatibility.
Implements quantization (bitsandbytes, GPT-Q, AWQ, Marlin, fp8) and logits warping.
Supports NVIDIA, AMD, Inferentia, Intel GPU, and Google TPU hardware.

Maintenance & Community

Actively developed and used by Hugging Face.
Community support via Discord/Slack is not explicitly mentioned in the README.

Licensing & Compatibility

Apache 2.0 license. Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

CPU inference is not optimized and may perform poorly.
Local installation requires careful setup of Rust, Python, and potentially Protobuf and OpenSSL/gcc on Linux.
Nix installation is limited to x86_64 Linux with CUDA GPUs.

Health Check

Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

44 stars in the last 30 days