Discover and explore top open-source AI tools and projects—updated daily.
WeianMaoLLM long reasoning acceleration via KV cache compression
New!
Top 72.8% on SourcePulse
Summary
TriAttention addresses the challenge of efficient long-context reasoning in large language models (LLMs) by introducing trigonometric KV cache compression. This technique significantly reduces memory requirements and boosts throughput, enabling LLMs to run on memory-constrained GPUs and facilitating local deployment via integrations like OpenClaw. It targets engineers and researchers seeking to overcome hardware limitations for demanding long-context tasks without sacrificing accuracy.
How It Works
The core innovation lies in trigonometric frequency-domain compression of the KV cache. Pre-RoPE Q/K vectors in long reasoning models exhibit predictable patterns related to distance preferences. TriAttention leverages these patterns by scoring keys based on their centers and norms, derived from trigonometric series, rather than relying on complex query selection. This approach achieves accurate KV cache compression with minimal computational overhead compared to traditional attention-based methods.
Quick Start & Requirements
pip install -e .. The flash-attn library is recommended and can be installed with pip install flash-attn --no-build-isolation.flash-attn. Runtime requires precomputed frequency statistics (e.g., TRIATTN_RUNTIME_SPARSE_STATS_PATH).Highlighted Details
Maintenance & Community
The project is associated with researchers from MIT and NVIDIA, including Song Han and Yukang Chen. The roadmap indicates planned integrations with SGLang and Ollama, alongside support for more model architectures.
Licensing & Compatibility
This project is licensed under the Apache License 2.0. This permissive license generally allows for commercial use and integration into closed-source projects.
Limitations & Caveats
Prefix caching is currently incompatible with KV compression and must be disabled. To prevent Out-Of-Memory (OOM) errors, especially with large prefill chunks, a reduced batch token limit (e.g., --max-num-batched-tokens 1024) is recommended. The system requires precomputed frequency statistics for runtime operation.
23 hours ago
Inactive
GeeeekExplorer