text-embeddings-inference by huggingface

Inference solution for text embeddings models

Created 2 years ago

4,370 stars

Top 11.1% on SourcePulse

View on GitHub

14 Experts Love This Project

and 10 more!

Project Summary

Text Embeddings Inference (TEI) is a high-performance inference solution for deploying and serving open-source text embeddings and sequence classification models. It targets developers and researchers needing efficient, scalable inference for applications like RAG, semantic search, and sentiment analysis, offering significant speedups and reduced latency.

How It Works

TEI leverages optimized Rust code, Flash Attention, and cuBLASLt for accelerated inference. It supports dynamic batching, Safetensors and ONNX weight loading, and features Metal support for local execution on Macs. The architecture is designed for low latency and high throughput, with features like OpenTelemetry for distributed tracing and Prometheus metrics for monitoring.

Quick Start & Requirements

Docker: docker run --gpus all -p 8080:80 -v $PWD/data:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id BAAI/bge-large-en-v1.5
Prerequisites: NVIDIA drivers compatible with CUDA 12.2+ for GPU usage. Local install requires Rust, OpenSSL, and GCC.
Links: API Documentation, Supported Models

Highlighted Details

Blazing fast inference for popular embedding models (e.g., BAAI/bge-base-en-v1.5).
Supports text embeddings, re-rankers, and sequence classification models.
Offers both HTTP and gRPC APIs for flexible deployment.
Optimized Docker images for various NVIDIA GPU architectures (Ampere, Ada Lovelace, Hopper).

Maintenance & Community

Actively maintained by Hugging Face.
Community support channels are not explicitly mentioned in the README.

Licensing & Compatibility

Apache 2.0 License.
Compatible with commercial use and closed-source applications.

Limitations & Caveats

Flash Attention is off by default for Turing GPUs due to potential precision issues.
CUDA compute capabilities below 7.5 are not supported.
Metal/MPS support is not available via Docker on M1/M2 Macs; local CPU inference will be slow.

Health Check

Last Commit

22 hours ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

91 stars in the last 30 days