MInference  by microsoft

Framework for long-context LLM inference speedup via sparse attention

created 1 year ago
1,082 stars

Top 35.7% on sourcepulse

GitHubView on GitHub
Project Summary

MInference accelerates long-context Large Language Model (LLM) inference, particularly the pre-filling stage, by employing dynamic sparse attention. It targets researchers and developers working with LLMs that require processing extensive contexts, offering up to a 10x speedup on A100 GPUs while maintaining accuracy.

How It Works

MInference leverages the observation that attention patterns in LLMs exhibit dynamic sparsity. It first identifies static sparse patterns offline for each attention head. Then, it approximates these sparse indices online and computes attention using optimized custom kernels. This approach reduces computational overhead by focusing only on relevant attention scores, leading to significant speed improvements.

Quick Start & Requirements

  • Install via pip: pip install minference
  • Prerequisites: PyTorch, FlashAttention-2 (optional), Triton, Transformers (>= 4.46.0).
  • Supports integration with Hugging Face Transformers and vLLM.
  • Official HF Demo available.

Highlighted Details

  • Achieves up to 10x speedup in pre-filling for million-token contexts on A100.
  • Supports a wide range of long-context LLMs including LLaMA-3.1, Qwen2.5, and GLM-4.
  • Offers various KV cache optimization methods (compression, retrieval, loading) beyond its core sparse attention.
  • Includes SCBench for evaluating long-context methods and MMInference for multimodal LLMs.

Maintenance & Community

  • Active development with contributions from Microsoft researchers.
  • Accepted at NeurIPS'24 (spotlight), ICLR'25, and ICML'25.
  • Related projects like SCBench and MMInference are also under active development.
  • Contributor License Agreement (CLA) required for contributions.

Licensing & Compatibility

  • The repository does not explicitly state a license in the provided README. This requires further investigation for commercial use or closed-source linking.

Limitations & Caveats

  • The specific license is not mentioned, which could be a blocker for commercial adoption.
  • While it supports many models, manual configuration for unsupported LLMs might be necessary.
Health Check
Last commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
4
Star History
85 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm).

LongLoRA by dvlab-research

0.1%
3k
LongLoRA: Efficient fine-tuning for long-context LLMs
created 1 year ago
updated 11 months ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

streaming-llm by mit-han-lab

0.1%
7k
Framework for efficient LLM streaming
created 1 year ago
updated 1 year ago
Feedback? Help us improve.