pyturboquant  by jorgebmann

GPU-accelerated online vector quantization for RAG and ANN

Created 1 month ago
399 stars

Top 72.1% on SourcePulse

GitHubView on GitHub
Project Summary

This library provides a Python implementation of Google's TurboQuant framework, enabling data-oblivious online vector quantization for embedding storage and approximate nearest neighbor (ANN) search. It offers significant memory compression for RAG pipelines and large-scale retrieval without requiring codebook training or multiple data passes, making it ideal for on-premise RAG and resource-constrained environments.

How It Works

The core approach involves applying a random orthogonal rotation to vectors, followed by per-coordinate scalar quantization using precomputed Lloyd-Max codebooks (MSE-Optimal Quantizer). For inner product estimation, a 1-bit Quantized Johnson-Lindenstrauss (QJL) transform is applied to the quantization residual, yielding an unbiased estimator. This method allows vectors to be quantized independently, facilitating truly online ingestion and eliminating the need for costly indexing or retraining steps.

Quick Start & Requirements

  • Installation:
    • Core library: pip install pyturboquant
    • With LangChain: pip install pyturboquant[langchain]
    • Development: pip install pyturboquant[dev]
    • All features: pip install pyturboquant[all]
  • Prerequisites: Python >= 3.12, PyTorch >= 2.4.
  • Documentation: Code examples in the README serve as a quick start.

Highlighted Details

  • MSE-Optimal Quantizer (Algorithm 1) and Inner Product Quantizer (Algorithm 2).
  • Zero-indexing-time ANN search via TurboQuantIndex with a FAISS-like API.
  • Bounded search-time memory, configurable via search_batch_size.
  • LangChain TurboQuantVectorStore for low-RAM RAG pipelines.
  • Pure PyTorch implementation, supporting CPU and CUDA.
  • Deterministic rotations and QJL projections for reproducibility.

Maintenance & Community

The project is marked as Work In Progress (WIP) with a roadmap indicating future enhancements like LlamaIndex integration and sub-linear search. No specific community channels (e.g., Discord, Slack) or notable contributors/sponsorships are detailed in the provided text.

Licensing & Compatibility

The project is released under the MIT License, which is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

This is a Work In Progress (WIP) implementation. The current search compute complexity is O(n) per query, with sub-linear search planned for v0.5.0. The library focuses on compressing embedding vectors, not the embedding models themselves, meaning VRAM requirements for running models remain unchanged.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
0
Star History
399 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.