text-generation-inference  by huggingface

Rust/Python/gRPC server for fast LLM text generation

created 2 years ago
10,376 stars

Top 4.9% on sourcepulse

GitHubView on GitHub
Project Summary

Text Generation Inference (TGI) is a high-performance Rust, Python, and gRPC server for deploying and serving large language models (LLMs). It's designed for production use, powering services like Hugging Chat and the Inference API, and targets developers and researchers needing efficient LLM inference.

How It Works

TGI leverages a Rust backend for performance-critical operations and Python for model integration. It supports advanced features like Tensor Parallelism for multi-GPU inference, continuous batching for increased throughput, and optimized transformers code using Flash Attention and Paged Attention. This architecture allows for low-latency, high-throughput text generation across a wide range of popular LLMs.

Quick Start & Requirements

  • Install/Run: Via Docker:
    docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:3.2.3 --model-id HuggingFaceH4/zephyr-7b-beta
    
  • Prerequisites: NVIDIA GPUs with CUDA 12.2+ recommended. AMD GPUs (MI210, MI250) supported via ROCm. CPU is not the intended platform. NVIDIA Container Toolkit required for NVIDIA GPUs.
  • Setup: Docker is the easiest method. Local install requires Rust and Python 3.9+.
  • Docs: Quick Tour, API Docs

Highlighted Details

  • Supports continuous batching and token streaming (SSE).
  • Offers OpenAI Chat Completion API compatibility.
  • Implements quantization (bitsandbytes, GPT-Q, AWQ, Marlin, fp8) and logits warping.
  • Supports NVIDIA, AMD, Inferentia, Intel GPU, and Google TPU hardware.

Maintenance & Community

  • Actively developed and used by Hugging Face.
  • Community support via Discord/Slack is not explicitly mentioned in the README.

Licensing & Compatibility

  • Apache 2.0 license. Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

  • CPU inference is not optimized and may perform poorly.
  • Local installation requires careful setup of Rust, Python, and potentially Protobuf and OpenSSL/gcc on Linux.
  • Nix installation is limited to x86_64 Linux with CUDA GPUs.
Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
9
Issues (30d)
13
Star History
336 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
2 more.

gpustack by gpustack

1.6%
3k
GPU cluster manager for AI model deployment
created 1 year ago
updated 2 days ago
Feedback? Help us improve.