MInference accelerates long-context Large Language Model (LLM) inference, particularly the pre-filling stage, by employing dynamic sparse attention. It targets researchers and developers working with LLMs that require processing extensive contexts, offering up to a 10x speedup on A100 GPUs while maintaining accuracy.
How It Works
MInference leverages the observation that attention patterns in LLMs exhibit dynamic sparsity. It first identifies static sparse patterns offline for each attention head. Then, it approximates these sparse indices online and computes attention using optimized custom kernels. This approach reduces computational overhead by focusing only on relevant attention scores, leading to significant speed improvements.
Quick Start & Requirements
- Install via pip:
pip install minference
- Prerequisites: PyTorch, FlashAttention-2 (optional), Triton, Transformers (>= 4.46.0).
- Supports integration with Hugging Face Transformers and vLLM.
- Official HF Demo available.
Highlighted Details
- Achieves up to 10x speedup in pre-filling for million-token contexts on A100.
- Supports a wide range of long-context LLMs including LLaMA-3.1, Qwen2.5, and GLM-4.
- Offers various KV cache optimization methods (compression, retrieval, loading) beyond its core sparse attention.
- Includes SCBench for evaluating long-context methods and MMInference for multimodal LLMs.
Maintenance & Community
- Active development with contributions from Microsoft researchers.
- Accepted at NeurIPS'24 (spotlight), ICLR'25, and ICML'25.
- Related projects like SCBench and MMInference are also under active development.
- Contributor License Agreement (CLA) required for contributions.
Licensing & Compatibility
- The repository does not explicitly state a license in the provided README. This requires further investigation for commercial use or closed-source linking.
Limitations & Caveats
- The specific license is not mentioned, which could be a blocker for commercial adoption.
- While it supports many models, manual configuration for unsupported LLMs might be necessary.