triattention  by WeianMao

LLM long reasoning acceleration via KV cache compression

Created 1 week ago

New!

396 stars

Top 72.8% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

TriAttention addresses the challenge of efficient long-context reasoning in large language models (LLMs) by introducing trigonometric KV cache compression. This technique significantly reduces memory requirements and boosts throughput, enabling LLMs to run on memory-constrained GPUs and facilitating local deployment via integrations like OpenClaw. It targets engineers and researchers seeking to overcome hardware limitations for demanding long-context tasks without sacrificing accuracy.

How It Works

The core innovation lies in trigonometric frequency-domain compression of the KV cache. Pre-RoPE Q/K vectors in long reasoning models exhibit predictable patterns related to distance preferences. TriAttention leverages these patterns by scoring keys based on their centers and norms, derived from trigonometric series, rather than relying on complex query selection. This approach achieves accurate KV cache compression with minimal computational overhead compared to traditional attention-based methods.

Quick Start & Requirements

  • Installation: Clone the repository, navigate to the directory, and install using pip install -e .. The flash-attn library is recommended and can be installed with pip install flash-attn --no-build-isolation.
  • Prerequisites: Python 3.10+ and flash-attn. Runtime requires precomputed frequency statistics (e.g., TRIATTN_RUNTIME_SPARSE_STATS_PATH).
  • Deployment: Integrates with OpenClaw via a vLLM server that exposes an OpenAI-compatible API.
  • Links: Paper: https://arxiv.org/abs/2604.04921, Project Page: https://weianmao.github.io/tri-attention-project-page/

Highlighted Details

  • Achieves a 10.7x reduction in KV cache memory and a 2.5x throughput increase on long reasoning tasks (AIME25) without accuracy loss.
  • Enables local deployment on memory-constrained GPUs, such as a 24GB RTX 4090, through OpenClaw compatibility.
  • Demonstrates a 6.3x speedup on the MATH-500 benchmark with a 1024 KV budget.

Maintenance & Community

The project is associated with researchers from MIT and NVIDIA, including Song Han and Yukang Chen. The roadmap indicates planned integrations with SGLang and Ollama, alongside support for more model architectures.

Licensing & Compatibility

This project is licensed under the Apache License 2.0. This permissive license generally allows for commercial use and integration into closed-source projects.

Limitations & Caveats

Prefix caching is currently incompatible with KV compression and must be disabled. To prevent Out-Of-Memory (OOM) errors, especially with large prefill chunks, a reduced batch token limit (e.g., --max-num-batched-tokens 1024) is recommended. The system requires precomputed frequency statistics for runtime operation.

Health Check
Last Commit

23 hours ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
3
Star History
401 stars in the last 8 days

Explore Similar Projects

Feedback? Help us improve.