turboquant-pytorch  by tonbistudio

LLM KV cache compression for extended context

Created 2 weeks ago

New!

889 stars

Top 40.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a from-scratch PyTorch implementation of Google's TurboQuant algorithm, designed to address the significant memory bottleneck posed by Key-Value (KV) caches in Large Language Models (LLMs). It targets LLM developers and researchers seeking to reduce memory footprints, enabling longer context windows or deployment on resource-constrained hardware. The primary benefit is achieving substantial KV cache compression (up to 7.3x) with minimal degradation in attention fidelity.

How It Works

TurboQuant employs a two-stage vector quantization approach. Stage 1 involves multiplying each vector by a random orthogonal matrix, which transforms its coordinates into a predictable distribution (approximating a standard normal distribution). This allows for optimal scalar quantization of each coordinate independently using the Lloyd-Max algorithm, minimizing Mean Squared Error (MSE) and precomputing codebooks. Stage 2 introduces Quantized Johnson-Lindenstrauss (QJL) residual correction, using just 1 bit per dimension to encode the error from Stage 1. This correction mathematically unbiases the dot product (attention score) estimation, preserving attention accuracy even though individual vectors are heavily distorted.

Quick Start & Requirements

  • Primary install command: pip install -r requirements.txt
  • Non-default prerequisites: Python 3.10+, PyTorch 2.0+ with CUDA (for GPU tests), scipy (for codebook computation). transformers, accelerate, bitsandbytes are required only for real model validation.
  • Resource footprint: Real model validation requires a CUDA GPU with at least 6GB VRAM. The example validate.py script uses approximately 2GB VRAM to load the Qwen2.5-3B-Instruct model.
  • Relevant pages: test_turboquant.py for synthetic validation, validate.py for real model validation.

Highlighted Details

  • Achieves 5x KV cache compression at 3-bit quantization with 99.5% attention fidelity (Cosine Similarity).
  • Real model validation on Qwen2.5-3B-Instruct demonstrates compression ratios of 3.8x (4-bit), 5.0x (3-bit), and 7.3x (2-bit) compared to the FP16 baseline.
  • At 3-bit compression, attention scores maintain ~0.9945 Cosine Similarity, 86% Top-1 Match, and 94% Top-5 Match across 8K context.
  • Synthetic vector tests confirm near-zero bias and high correlation (0.93 at 3-bit) for inner product estimation with QJL correction.

Maintenance & Community

No specific details regarding notable contributors, sponsorships, or community channels (e.g., Discord, Slack) are provided in the README.

Licensing & Compatibility

  • License type: MIT.
  • Compatibility notes: No explicit restrictions for commercial use or closed-source linking are mentioned.

Limitations & Caveats

Direct decompression of vectors and feeding them to standard attention mechanisms results in unusable model output, as the algorithm prioritizes accurate attention scores over vector fidelity. Real model validation necessitates a CUDA GPU with a minimum of 6GB VRAM. The implementation is based on a paper published in ICLR 2026, indicating it represents cutting-edge research.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
14
Star History
904 stars in the last 17 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI) and Jeremy Howard Jeremy Howard(Cofounder of fast.ai).

QuaRot by spcl

0.6%
501
Code for a NeurIPS 2024 research paper on LLM quantization
Created 2 years ago
Updated 1 year ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
41 more.

unsloth by unslothai

2.6%
61k
Finetuning tool for LLMs, targeting speed and memory efficiency
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.