KV cache compression for reasoning models
Top 34.6% on SourcePulse
R-KV addresses the significant memory overhead of KV caches in Large Language Models (LLMs) during long-form reasoning tasks like Chain-of-Thought. It targets researchers and engineers working with reasoning-focused LLMs, offering substantial memory savings and throughput improvements with minimal accuracy loss.
How It Works
R-KV employs a novel redundancy-aware KV cache compression strategy during decoding. It scores newly generated tokens based on both their importance (derived from attention weights) and their non-redundancy (using cosine similarity to identify and prune near-duplicates). A joint selection mechanism then retains the top-k tokens within a budget, balancing memory savings against accuracy. This approach is advantageous as it specifically targets the redundancy inherent in reasoning traces, unlike methods optimized for prompt compression.
Quick Start & Requirements
pip install -r requirements.txt
python model = AutoModelForCausalLM.from_pretrained("model_name_or_path", attn_implementation="flash_attention_2")
pip install -e .
bash examples/run.sh
or python3 ./run_math.py ...
cd evaluation/latex2sympy && pip install -e . && cd .. && pip install -r requirements.txt
Highlighted Details
Maintenance & Community
The project was released on May 25, 2029. No specific community links (Discord/Slack) or active contributor information are provided in the README.
Licensing & Compatibility
The README does not explicitly state a license. The citation lists authors from multiple institutions, suggesting potential academic licensing. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The README does not detail specific limitations, unsupported platforms, or known bugs. The project appears to be a recent release with a focus on specific reasoning benchmarks (MATH-500, AIME-24) and DeepSeek-R1 variants.
1 week ago
Inactive