MInference by microsoft

Framework for long-context LLM inference speedup via sparse attention

Created 1 year ago

1,169 stars

Top 33.1% on SourcePulse

2 Experts Love This Project

zhyncs

Inference Lead at SGLang; Research Scientist at Together AI

pgarbacki

Cofounder of Fireworks AI

Project Summary

MInference accelerates long-context Large Language Model (LLM) inference, particularly the pre-filling stage, by employing dynamic sparse attention. It targets researchers and developers working with LLMs that require processing extensive contexts, offering up to a 10x speedup on A100 GPUs while maintaining accuracy.

How It Works

MInference leverages the observation that attention patterns in LLMs exhibit dynamic sparsity. It first identifies static sparse patterns offline for each attention head. Then, it approximates these sparse indices online and computes attention using optimized custom kernels. This approach reduces computational overhead by focusing only on relevant attention scores, leading to significant speed improvements.

Quick Start & Requirements

Install via pip: pip install minference
Prerequisites: PyTorch, FlashAttention-2 (optional), Triton, Transformers (>= 4.46.0).
Supports integration with Hugging Face Transformers and vLLM.
Official HF Demo available.

Highlighted Details

Achieves up to 10x speedup in pre-filling for million-token contexts on A100.
Supports a wide range of long-context LLMs including LLaMA-3.1, Qwen2.5, and GLM-4.
Offers various KV cache optimization methods (compression, retrieval, loading) beyond its core sparse attention.
Includes SCBench for evaluating long-context methods and MMInference for multimodal LLMs.

Maintenance & Community

Active development with contributions from Microsoft researchers.
Accepted at NeurIPS'24 (spotlight), ICLR'25, and ICML'25.
Related projects like SCBench and MMInference are also under active development.
Contributor License Agreement (CLA) required for contributions.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. This requires further investigation for commercial use or closed-source linking.

Limitations & Caveats

The specific license is not mentioned, which could be a blocker for commercial adoption.
While it supports many models, manual configuration for unsupported LLMs might be necessary.

Health Check

Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)

1

Issues (30d)

1

Star History

8 stars in the last 30 days

Explore Similar Projects

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory) and

Wing Lian

Wing Lian(Founder of Axolotl AI).

long-llms-learning by Strivin0311

Literature repository for long-context LLM methodologies

Created 2 years ago

Updated 1 year ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI).

native-sparse-attention-triton by XunhaoLai

Efficient sparse attention for LLMs

Created 10 months ago

Updated 7 months ago

Starred by

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

Star-Attention by NVIDIA

PyTorch code for efficient LLM inference on long sequences

Created 1 year ago

Updated 6 months ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI).

FastV by pkunlp-icler

Inference acceleration for large vision-language models (research paper)

Created 1 year ago

Updated 1 year ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

Quest by mit-han-lab

Inference framework for efficient long-context LLM inference

Created 1 year ago

Updated 6 months ago

Block-Sparse-Attention by mit-han-lab

Efficient sparse attention kernels for LLMs

Created 1 year ago

Updated 3 days ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera) and

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

gated_attention by qiuzh20

Gated attention for LLMs: Non-linearity, sparsity, and attention-sink-free

Created 8 months ago

Updated 3 weeks ago

omniserve by mit-han-lab

Unified inference engine for large-scale LLM serving

Created 1 year ago

Updated 10 months ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI).

Kimi-Linear by MoonshotAI

Efficient linear attention architecture accelerates long-context LLMs

Created 2 months ago

Updated 1 month ago

Starred by

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI),

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and

1 more.

DeepSeek-V3.2-Exp by deepseek-ai

Experimental LLM boosting long-context efficiency

Created 3 months ago

Updated 1 month ago

KVCache-Factory by Zefan-Cai

Unified framework for KV cache compression in auto-regressive models

Created 1 year ago

Updated 1 year ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

Awesome-LLM-Inference by xlite-dev

Curated list of LLM/VLM inference research papers with code

Created 2 years ago

Updated 1 month ago

Feedback? Help us improve.