Discover and explore top open-source AI tools and projects—updated daily.
tonbistudioLLM KV cache compression for extended context
New!
Top 40.5% on SourcePulse
This repository provides a from-scratch PyTorch implementation of Google's TurboQuant algorithm, designed to address the significant memory bottleneck posed by Key-Value (KV) caches in Large Language Models (LLMs). It targets LLM developers and researchers seeking to reduce memory footprints, enabling longer context windows or deployment on resource-constrained hardware. The primary benefit is achieving substantial KV cache compression (up to 7.3x) with minimal degradation in attention fidelity.
How It Works
TurboQuant employs a two-stage vector quantization approach. Stage 1 involves multiplying each vector by a random orthogonal matrix, which transforms its coordinates into a predictable distribution (approximating a standard normal distribution). This allows for optimal scalar quantization of each coordinate independently using the Lloyd-Max algorithm, minimizing Mean Squared Error (MSE) and precomputing codebooks. Stage 2 introduces Quantized Johnson-Lindenstrauss (QJL) residual correction, using just 1 bit per dimension to encode the error from Stage 1. This correction mathematically unbiases the dot product (attention score) estimation, preserving attention accuracy even though individual vectors are heavily distorted.
Quick Start & Requirements
pip install -r requirements.txtscipy (for codebook computation). transformers, accelerate, bitsandbytes are required only for real model validation.validate.py script uses approximately 2GB VRAM to load the Qwen2.5-3B-Instruct model.test_turboquant.py for synthetic validation, validate.py for real model validation.Highlighted Details
Maintenance & Community
No specific details regarding notable contributors, sponsorships, or community channels (e.g., Discord, Slack) are provided in the README.
Licensing & Compatibility
Limitations & Caveats
Direct decompression of vectors and feeding them to standard attention mechanisms results in unusable model output, as the algorithm prioritizes accurate attention scores over vector fidelity. Real model validation necessitates a CUDA GPU with a minimum of 6GB VRAM. The implementation is based on a paper published in ICLR 2026, indicating it represents cutting-edge research.
1 week ago
Inactive
spcl
casper-hansen
unslothai