infinity by michaelfeil

REST API for high-throughput, low-latency embedding and reranking

Created 2 years ago

2,613 stars

Top 17.8% on SourcePulse

View on GitHub

13 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Inference Lead at SGLang; Research Scientist at Together AI

and 9 more!

Project Summary

Infinity is a high-throughput, low-latency REST API for serving text-embeddings, reranking, CLIP, CLAP, and ColPali models. It targets developers and researchers needing efficient inference for RAG and multimodal AI tasks, offering compatibility with Hugging Face models and OpenAI API specifications.

How It Works

Infinity leverages multiple fast inference backends including PyTorch, Optimum (ONNX/TensorRT), and CTranslate2, with optimizations like FlashAttention for NVIDIA CUDA, AMD ROCm, CPU, AWS INF2, and Apple MPS. It employs dynamic batching and tokenization in dedicated worker threads for efficient processing. The engine supports orchestrating multiple models simultaneously, enabling mix-and-match functionality for diverse AI pipelines.

Quick Start & Requirements

Install via pip: pip install infinity-emb[all]
Docker is recommended for deployment.
Supports NVIDIA CUDA, AMD ROCm, CPU, AWS INF2, and Apple MPS accelerators.
Requires Python 3.11+ for development.
Official documentation: https://michaelfeil.github.io/infinity

Highlighted Details

Deploys any model from Hugging Face.
Supports text embeddings, reranking, multimodal (CLIP, CLAP), and text classification.
Experimental support for INT8 (CPU/CUDA) and FP8 (H100/MI300).
OpenAI API compatible REST API.

Maintenance & Community

Developed by Michael Feil.
Active development with recent updates in late 2024.
Community links are not explicitly provided in the README.

Licensing & Compatibility

MIT License.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

Specialized Docker images for ROCm and TensorRT/ONNX are not continuously built via CI/CD and may require pinning to exact versions.
CTranslate2 engine only supports BERT models.
Plain vision models (e.g., nomic-ai/nomic-embed-vision-v1.5) are not supported for multimodal tasks.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

39 stars in the last 30 days