R-KV  by Zefan-Cai

KV cache compression for reasoning models

created 2 months ago
1,099 stars

Top 34.6% on SourcePulse

GitHubView on GitHub
Project Summary

R-KV addresses the significant memory overhead of KV caches in Large Language Models (LLMs) during long-form reasoning tasks like Chain-of-Thought. It targets researchers and engineers working with reasoning-focused LLMs, offering substantial memory savings and throughput improvements with minimal accuracy loss.

How It Works

R-KV employs a novel redundancy-aware KV cache compression strategy during decoding. It scores newly generated tokens based on both their importance (derived from attention weights) and their non-redundancy (using cosine similarity to identify and prune near-duplicates). A joint selection mechanism then retains the top-k tokens within a budget, balancing memory savings against accuracy. This approach is advantageous as it specifically targets the redundancy inherent in reasoning traces, unlike methods optimized for prompt compression.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Install FlashAttention (recommended): python model = AutoModelForCausalLM.from_pretrained("model_name_or_path", attn_implementation="flash_attention_2")
  • Build R-KV package: pip install -e .
  • Run example: bash examples/run.sh or python3 ./run_math.py ...
  • Evaluation toolkit setup: cd evaluation/latex2sympy && pip install -e . && cd .. && pip install -r requirements.txt
  • Requires Python and CUDA.

Highlighted Details

  • Achieves ≈100% accuracy with only 10% of the KV cache.
  • Demonstrates up to 6.6x throughput increase and 90% memory savings on long CoT generation.
  • Outperforms baselines like SnapKV by retaining more diverse and important tokens.
  • Offers a plug-and-play, training-free solution for existing autoregressive LLMs.

Maintenance & Community

The project was released on May 25, 2029. No specific community links (Discord/Slack) or active contributor information are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. The citation lists authors from multiple institutions, suggesting potential academic licensing. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not detail specific limitations, unsupported platforms, or known bugs. The project appears to be a recent release with a focus on specific reasoning benchmarks (MATH-500, AIME-24) and DeepSeek-R1 variants.

Health Check
Last commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
4
Star History
20 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.