REST API for high-throughput, low-latency embedding and reranking
Top 19.9% on sourcepulse
Infinity is a high-throughput, low-latency REST API for serving text-embeddings, reranking, CLIP, CLAP, and ColPali models. It targets developers and researchers needing efficient inference for RAG and multimodal AI tasks, offering compatibility with Hugging Face models and OpenAI API specifications.
How It Works
Infinity leverages multiple fast inference backends including PyTorch, Optimum (ONNX/TensorRT), and CTranslate2, with optimizations like FlashAttention for NVIDIA CUDA, AMD ROCm, CPU, AWS INF2, and Apple MPS. It employs dynamic batching and tokenization in dedicated worker threads for efficient processing. The engine supports orchestrating multiple models simultaneously, enabling mix-and-match functionality for diverse AI pipelines.
Quick Start & Requirements
pip install infinity-emb[all]
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 week ago
1 day