text-embeddings-inference  by huggingface

Inference solution for text embeddings models

created 1 year ago
3,852 stars

Top 12.9% on sourcepulse

GitHubView on GitHub
Project Summary

Text Embeddings Inference (TEI) is a high-performance inference solution for deploying and serving open-source text embeddings and sequence classification models. It targets developers and researchers needing efficient, scalable inference for applications like RAG, semantic search, and sentiment analysis, offering significant speedups and reduced latency.

How It Works

TEI leverages optimized Rust code, Flash Attention, and cuBLASLt for accelerated inference. It supports dynamic batching, Safetensors and ONNX weight loading, and features Metal support for local execution on Macs. The architecture is designed for low latency and high throughput, with features like OpenTelemetry for distributed tracing and Prometheus metrics for monitoring.

Quick Start & Requirements

  • Docker: docker run --gpus all -p 8080:80 -v $PWD/data:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id BAAI/bge-large-en-v1.5
  • Prerequisites: NVIDIA drivers compatible with CUDA 12.2+ for GPU usage. Local install requires Rust, OpenSSL, and GCC.
  • Links: API Documentation, Supported Models

Highlighted Details

  • Blazing fast inference for popular embedding models (e.g., BAAI/bge-base-en-v1.5).
  • Supports text embeddings, re-rankers, and sequence classification models.
  • Offers both HTTP and gRPC APIs for flexible deployment.
  • Optimized Docker images for various NVIDIA GPU architectures (Ampere, Ada Lovelace, Hopper).

Maintenance & Community

  • Actively maintained by Hugging Face.
  • Community support channels are not explicitly mentioned in the README.

Licensing & Compatibility

  • Apache 2.0 License.
  • Compatible with commercial use and closed-source applications.

Limitations & Caveats

  • Flash Attention is off by default for Turing GPUs due to potential precision issues.
  • CUDA compute capabilities below 7.5 are not supported.
  • Metal/MPS support is not available via Docker on M1/M2 Macs; local CPU inference will be slow.
Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
9
Issues (30d)
10
Star History
375 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.