Discover and explore top open-source AI tools and projects—updated daily.
Framework for long-context LLM inference speedup via sparse attention
Top 34.1% on SourcePulse
MInference accelerates long-context Large Language Model (LLM) inference, particularly the pre-filling stage, by employing dynamic sparse attention. It targets researchers and developers working with LLMs that require processing extensive contexts, offering up to a 10x speedup on A100 GPUs while maintaining accuracy.
How It Works
MInference leverages the observation that attention patterns in LLMs exhibit dynamic sparsity. It first identifies static sparse patterns offline for each attention head. Then, it approximates these sparse indices online and computes attention using optimized custom kernels. This approach reduces computational overhead by focusing only on relevant attention scores, leading to significant speed improvements.
Quick Start & Requirements
pip install minference
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 month ago
1 day