Unified framework for KV cache compression in auto-regressive models
Top 32.9% on sourcepulse
This project provides a unified framework for KV cache compression techniques, aiming to reduce memory usage and improve inference speed for auto-regressive models. It targets researchers and engineers working with large language models who need to handle longer contexts efficiently.
How It Works
The framework integrates various KV cache compression methods, including PyramidKV, SnapKV, H2O, and StreamingLLM. It leverages optimized attention implementations like Flash Attention v2 and Scaled Dot Product Attention (SDPA) to apply these compression strategies. This approach allows for dynamic KV cache management, reducing memory footprint without significant performance degradation.
Quick Start & Requirements
git clone https://github.com/Zefan-Cai/PyramidKV.git
cd PyramidKV
pip install -r requirements.txt .
transformers >= 4.41
, flash-attn >= 2.4.0.post1
. CUDA is required for Flash Attention v2.Highlighted Details
Maintenance & Community
The project has seen recent updates, including name changes and support for new models and techniques. Links to relevant papers are provided for citation.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial or closed-source use.
Limitations & Caveats
The README indicates ongoing development with several items in the TODO list, including support for Mixtral and batch inference. Some features might be experimental or require specific hardware configurations (e.g., Flash Attention v2 compatibility). Model support is currently limited to Llama-3 and Mistral-7B.
7 months ago
1 day