KVCache-Factory  by Zefan-Cai

Unified framework for KV cache compression in auto-regressive models

created 1 year ago
1,219 stars

Top 32.9% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a unified framework for KV cache compression techniques, aiming to reduce memory usage and improve inference speed for auto-regressive models. It targets researchers and engineers working with large language models who need to handle longer contexts efficiently.

How It Works

The framework integrates various KV cache compression methods, including PyramidKV, SnapKV, H2O, and StreamingLLM. It leverages optimized attention implementations like Flash Attention v2 and Scaled Dot Product Attention (SDPA) to apply these compression strategies. This approach allows for dynamic KV cache management, reducing memory footprint without significant performance degradation.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies:
    git clone https://github.com/Zefan-Cai/PyramidKV.git
    cd PyramidKV
    pip install -r requirements.txt .
    
  • Prerequisites: Python, transformers >= 4.41, flash-attn >= 2.4.0.post1. CUDA is required for Flash Attention v2.
  • Inference: Scripts are provided for LongBench and Needle-in-a-Haystack evaluations. Example commands are available in the README for both tasks, specifying methods, model paths, and attention implementations.
  • Documentation: Inference scripts and visualization tools are available.

Highlighted Details

  • Supports multiple KV cache compression methods: PyramidKV, SnapKV, H2O, and StreamingLLM.
  • Optimized for Flash Attention v2 and SDPA, with support for multi-GPU inference.
  • Includes evaluation scripts for LongBench and Needle-in-a-Haystack benchmarks.
  • Offers visualization tools for KV cache attention maps.

Maintenance & Community

The project has seen recent updates, including name changes and support for new models and techniques. Links to relevant papers are provided for citation.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The README indicates ongoing development with several items in the TODO list, including support for Mixtral and batch inference. Some features might be experimental or require specific hardware configurations (e.g., Flash Attention v2 compatibility). Model support is currently limited to Llama-3 and Mistral-7B.

Health Check
Last commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
185 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.