flashinfer  by flashinfer-ai

Kernel library for LLM serving

created 2 years ago
3,455 stars

Top 14.3% on sourcepulse

GitHubView on GitHub
Project Summary

FlashInfer is a high-performance kernel library and generator for Large Language Model (LLM) serving, targeting researchers and engineers building efficient inference systems. It provides optimized implementations of key LLM operations like attention and sampling, aiming to deliver state-of-the-art performance and memory efficiency.

How It Works

FlashInfer leverages custom CUDA kernels and a JIT compilation approach to offer highly optimized LLM operations. It features efficient sparse/dense attention kernels for both CUDA Cores and Tensor Cores, load-balanced scheduling for variable-length inputs, and memory-saving techniques like Cascade Attention and Head-Query fusion. The library also supports customizable attention variants and integrates with CUDAGraphs and torch.compile for reduced latency.

Quick Start & Requirements

  • Installation: pip install flashinfer-python (prebuilt wheels available for Linux with specific CUDA/PyTorch versions, e.g., pip install flashinfer-python -i https://flashinfer.ai/whl/cu126/torch2.6). Nightly builds and source installation are also supported.
  • Prerequisites: CUDA Toolkit, PyTorch. NVCC is required for JIT compilation from source.
  • Resources: Requires a CUDA-enabled GPU.
  • Documentation: https://docs.flashinfer.ai/

Highlighted Details

  • Achieves up to 90% of dense kernel bandwidth for vector-sparse attention.
  • Supports efficient low-precision and fused-RoPE attention for KV-Cache compression.
  • Provides sorting-free GPU kernels for Top-P, Top-K/Min-P sampling.
  • Compatible with PyTorch, TVM, and C++ (header-only) APIs.

Maintenance & Community

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • Prebuilt wheels are primarily for Linux; other platforms may require building from source.
  • The license is not clearly stated, which may impact commercial adoption.
Health Check
Last commit

12 hours ago

Responsiveness

1 day

Pull Requests (30d)
132
Issues (30d)
31
Star History
706 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 11 hours ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

ktransformers by kvcache-ai

0.4%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 2 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
16 more.

flash-attention by Dao-AILab

0.7%
19k
Fast, memory-efficient attention implementation
created 3 years ago
updated 14 hours ago
Feedback? Help us improve.