pyturboquant by jorgebmann

GPU-accelerated online vector quantization for RAG and ANN

Created 1 month ago

399 stars

Top 72.1% on SourcePulse

Project Summary

This library provides a Python implementation of Google's TurboQuant framework, enabling data-oblivious online vector quantization for embedding storage and approximate nearest neighbor (ANN) search. It offers significant memory compression for RAG pipelines and large-scale retrieval without requiring codebook training or multiple data passes, making it ideal for on-premise RAG and resource-constrained environments.

How It Works

The core approach involves applying a random orthogonal rotation to vectors, followed by per-coordinate scalar quantization using precomputed Lloyd-Max codebooks (MSE-Optimal Quantizer). For inner product estimation, a 1-bit Quantized Johnson-Lindenstrauss (QJL) transform is applied to the quantization residual, yielding an unbiased estimator. This method allows vectors to be quantized independently, facilitating truly online ingestion and eliminating the need for costly indexing or retraining steps.

Quick Start & Requirements

Installation:
- Core library: pip install pyturboquant
- With LangChain: pip install pyturboquant[langchain]
- Development: pip install pyturboquant[dev]
- All features: pip install pyturboquant[all]
Prerequisites: Python >= 3.12, PyTorch >= 2.4.
Documentation: Code examples in the README serve as a quick start.

Highlighted Details

MSE-Optimal Quantizer (Algorithm 1) and Inner Product Quantizer (Algorithm 2).
Zero-indexing-time ANN search via TurboQuantIndex with a FAISS-like API.
Bounded search-time memory, configurable via search_batch_size.
LangChain TurboQuantVectorStore for low-RAM RAG pipelines.
Pure PyTorch implementation, supporting CPU and CUDA.
Deterministic rotations and QJL projections for reproducibility.

Maintenance & Community

The project is marked as Work In Progress (WIP) with a roadmap indicating future enhancements like LlamaIndex integration and sub-linear search. No specific community channels (e.g., Discord, Slack) or notable contributors/sponsorships are detailed in the provided text.

Licensing & Compatibility

The project is released under the MIT License, which is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

This is a Work In Progress (WIP) implementation. The current search compute complexity is O(n) per query, with sub-linear search planned for v0.5.0. The library focuses on compressing embedding vectors, not the embedding models themselves, meaning VRAM requirements for running models remain unchanged.

pyturboquant by jorgebmann

Explore Similar Projects

OneCompression by FujitsuResearch

turboquant-wasm by teamchong

PiSSA by MuLabPKU

granne by granne

turboquant-gpu by DevTechJr

Atom by efeslab

turbovec by RyanCodrai

dion by microsoft

VectorChord by tensorchord

PyTorchTricks by lartpang

AQLM by Vahe1994

milvus by milvus-io