text-embeddings-inference  by huggingface

Inference solution for text embeddings models

Created 1 year ago
4,019 stars

Top 12.3% on SourcePulse

GitHubView on GitHub
Project Summary

Text Embeddings Inference (TEI) is a high-performance inference solution for deploying and serving open-source text embeddings and sequence classification models. It targets developers and researchers needing efficient, scalable inference for applications like RAG, semantic search, and sentiment analysis, offering significant speedups and reduced latency.

How It Works

TEI leverages optimized Rust code, Flash Attention, and cuBLASLt for accelerated inference. It supports dynamic batching, Safetensors and ONNX weight loading, and features Metal support for local execution on Macs. The architecture is designed for low latency and high throughput, with features like OpenTelemetry for distributed tracing and Prometheus metrics for monitoring.

Quick Start & Requirements

  • Docker: docker run --gpus all -p 8080:80 -v $PWD/data:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id BAAI/bge-large-en-v1.5
  • Prerequisites: NVIDIA drivers compatible with CUDA 12.2+ for GPU usage. Local install requires Rust, OpenSSL, and GCC.
  • Links: API Documentation, Supported Models

Highlighted Details

  • Blazing fast inference for popular embedding models (e.g., BAAI/bge-base-en-v1.5).
  • Supports text embeddings, re-rankers, and sequence classification models.
  • Offers both HTTP and gRPC APIs for flexible deployment.
  • Optimized Docker images for various NVIDIA GPU architectures (Ampere, Ada Lovelace, Hopper).

Maintenance & Community

  • Actively maintained by Hugging Face.
  • Community support channels are not explicitly mentioned in the README.

Licensing & Compatibility

  • Apache 2.0 License.
  • Compatible with commercial use and closed-source applications.

Limitations & Caveats

  • Flash Attention is off by default for Turing GPUs due to potential precision issues.
  • CUDA compute capabilities below 7.5 are not supported.
  • Metal/MPS support is not available via Docker on M1/M2 Macs; local CPU inference will be slow.
Health Check
Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
14
Issues (30d)
12
Star History
121 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dots.llm1 by rednote-hilab

0.2%
462
MoE model for research
Created 4 months ago
Updated 4 weeks ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

recurrent-pretraining by seal-rg

0%
827
Pretraining code for depth-recurrent language model research
Created 7 months ago
Updated 1 week ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.