KVCache-Factory by Zefan-Cai

Unified framework for KV cache compression in auto-regressive models

Created 1 year ago

1,297 stars

Top 30.7% on SourcePulse

Project Summary

This project provides a unified framework for KV cache compression techniques, aiming to reduce memory usage and improve inference speed for auto-regressive models. It targets researchers and engineers working with large language models who need to handle longer contexts efficiently.

How It Works

The framework integrates various KV cache compression methods, including PyramidKV, SnapKV, H2O, and StreamingLLM. It leverages optimized attention implementations like Flash Attention v2 and Scaled Dot Product Attention (SDPA) to apply these compression strategies. This approach allows for dynamic KV cache management, reducing memory footprint without significant performance degradation.

Quick Start & Requirements

Installation: Clone the repository and install dependencies:

git clone https://github.com/Zefan-Cai/PyramidKV.git
cd PyramidKV
pip install -r requirements.txt .

Prerequisites: Python, transformers >= 4.41, flash-attn >= 2.4.0.post1. CUDA is required for Flash Attention v2.
Inference: Scripts are provided for LongBench and Needle-in-a-Haystack evaluations. Example commands are available in the README for both tasks, specifying methods, model paths, and attention implementations.
Documentation: Inference scripts and visualization tools are available.

Highlighted Details

Supports multiple KV cache compression methods: PyramidKV, SnapKV, H2O, and StreamingLLM.
Optimized for Flash Attention v2 and SDPA, with support for multi-GPU inference.
Includes evaluation scripts for LongBench and Needle-in-a-Haystack benchmarks.
Offers visualization tools for KV cache attention maps.

Maintenance & Community

The project has seen recent updates, including name changes and support for new models and techniques. Links to relevant papers are provided for citation.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The README indicates ongoing development with several items in the TODO list, including support for Mixtral and batch inference. Some features might be experimental or require specific hardware configurations (e.g., Flash Attention v2 compatibility). Model support is currently limited to Llama-3 and Mistral-7B.

KVCache-Factory by Zefan-Cai

Explore Similar Projects

Awesome-KV-Cache-Management by TreeAI-Lab

Awesome-KV-Cache-Compression by October2001

FastV by pkunlp-icler

Awesome-LLM-KV-Cache by Zefan-Cai

SnapKV by FasterDecoding

vattention by microsoft

TPA by tensorgi

Quest by mit-han-lab

omniserve by mit-han-lab

MInference by microsoft

kvpress by NVIDIA

Awesome-LLM-Inference by xlite-dev