turboquant-pytorch by tonbistudio

LLM KV cache compression for extended context

Created 3 months ago

1,026 stars

Top 35.8% on SourcePulse

Project Summary

This repository provides a from-scratch PyTorch implementation of Google's TurboQuant algorithm, designed to address the significant memory bottleneck posed by Key-Value (KV) caches in Large Language Models (LLMs). It targets LLM developers and researchers seeking to reduce memory footprints, enabling longer context windows or deployment on resource-constrained hardware. The primary benefit is achieving substantial KV cache compression (up to 7.3x) with minimal degradation in attention fidelity.

How It Works

TurboQuant employs a two-stage vector quantization approach. Stage 1 involves multiplying each vector by a random orthogonal matrix, which transforms its coordinates into a predictable distribution (approximating a standard normal distribution). This allows for optimal scalar quantization of each coordinate independently using the Lloyd-Max algorithm, minimizing Mean Squared Error (MSE) and precomputing codebooks. Stage 2 introduces Quantized Johnson-Lindenstrauss (QJL) residual correction, using just 1 bit per dimension to encode the error from Stage 1. This correction mathematically unbiases the dot product (attention score) estimation, preserving attention accuracy even though individual vectors are heavily distorted.

Quick Start & Requirements

Primary install command: pip install -r requirements.txt
Non-default prerequisites: Python 3.10+, PyTorch 2.0+ with CUDA (for GPU tests), scipy (for codebook computation). transformers, accelerate, bitsandbytes are required only for real model validation.
Resource footprint: Real model validation requires a CUDA GPU with at least 6GB VRAM. The example validate.py script uses approximately 2GB VRAM to load the Qwen2.5-3B-Instruct model.
Relevant pages: test_turboquant.py for synthetic validation, validate.py for real model validation.

Highlighted Details

Achieves 5x KV cache compression at 3-bit quantization with 99.5% attention fidelity (Cosine Similarity).
Real model validation on Qwen2.5-3B-Instruct demonstrates compression ratios of 3.8x (4-bit), 5.0x (3-bit), and 7.3x (2-bit) compared to the FP16 baseline.
At 3-bit compression, attention scores maintain ~0.9945 Cosine Similarity, 86% Top-1 Match, and 94% Top-5 Match across 8K context.
Synthetic vector tests confirm near-zero bias and high correlation (0.93 at 3-bit) for inner product estimation with QJL correction.

Maintenance & Community

No specific details regarding notable contributors, sponsorships, or community channels (e.g., Discord, Slack) are provided in the README.

Licensing & Compatibility

License type: MIT.
Compatibility notes: No explicit restrictions for commercial use or closed-source linking are mentioned.

Limitations & Caveats

Direct decompression of vectors and feeding them to standard attention mechanisms results in unusable model output, as the algorithm prioritizes accurate attention scores over vector fidelity. Real model validation necessitates a CUDA GPU with a minimum of 6GB VRAM. The implementation is based on a paper published in ICLR 2026, indicating it represents cutting-edge research.

turboquant-pytorch by tonbistudio

Explore Similar Projects

OneCompression by FujitsuResearch

DGX_Spark_Qwen3.5-122B-A10B-AR-INT4 by albond

KVQuant by SqueezeAILab

buun-llama-cpp by spiritbuun

QuaRot by spcl

OSCAR by FutureMLS-Lab

rotorquant by scrya-com

exllamav3 by turboderp-org

turboquant by 0xSero

GPTQModel by ModelCloud

AutoAWQ by casper-hansen

unsloth by unslothai