text-generation-inference  by huggingface

Rust/Python/gRPC server for fast LLM text generation

Created 2 years ago
10,515 stars

Top 4.8% on SourcePulse

GitHubView on GitHub
Project Summary

Text Generation Inference (TGI) is a high-performance Rust, Python, and gRPC server for deploying and serving large language models (LLMs). It's designed for production use, powering services like Hugging Chat and the Inference API, and targets developers and researchers needing efficient LLM inference.

How It Works

TGI leverages a Rust backend for performance-critical operations and Python for model integration. It supports advanced features like Tensor Parallelism for multi-GPU inference, continuous batching for increased throughput, and optimized transformers code using Flash Attention and Paged Attention. This architecture allows for low-latency, high-throughput text generation across a wide range of popular LLMs.

Quick Start & Requirements

  • Install/Run: Via Docker:
    docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:3.2.3 --model-id HuggingFaceH4/zephyr-7b-beta
    
  • Prerequisites: NVIDIA GPUs with CUDA 12.2+ recommended. AMD GPUs (MI210, MI250) supported via ROCm. CPU is not the intended platform. NVIDIA Container Toolkit required for NVIDIA GPUs.
  • Setup: Docker is the easiest method. Local install requires Rust and Python 3.9+.
  • Docs: Quick Tour, API Docs

Highlighted Details

  • Supports continuous batching and token streaming (SSE).
  • Offers OpenAI Chat Completion API compatibility.
  • Implements quantization (bitsandbytes, GPT-Q, AWQ, Marlin, fp8) and logits warping.
  • Supports NVIDIA, AMD, Inferentia, Intel GPU, and Google TPU hardware.

Maintenance & Community

  • Actively developed and used by Hugging Face.
  • Community support via Discord/Slack is not explicitly mentioned in the README.

Licensing & Compatibility

  • Apache 2.0 license. Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

  • CPU inference is not optimized and may perform poorly.
  • Local installation requires careful setup of Rust, Python, and potentially Protobuf and OpenSSL/gcc on Linux.
  • Nix installation is limited to x86_64 Linux with CUDA GPUs.
Health Check
Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
19
Issues (30d)
6
Star History
99 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), and
13 more.

torchtitan by pytorch

0.7%
4k
PyTorch platform for generative AI model training research
Created 1 year ago
Updated 19 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Lei Zhang Lei Zhang(Director Engineering AI at AMD), and
23 more.

gpt-fast by meta-pytorch

0.2%
6k
PyTorch text generation for efficient transformer inference
Created 1 year ago
Updated 3 weeks ago
Feedback? Help us improve.