flashinfer by flashinfer-ai

Kernel library for LLM serving

Created 2 years ago

4,564 stars

Top 10.7% on SourcePulse

View on GitHub

16 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Jeff Hammerbacher

Cofounder of Cloudera

Junyang Lin

Core Maintainer at Alibaba Qwen

Luis Capelo

Cofounder of Lightning AI

and 12 more!

Project Summary

FlashInfer is a high-performance kernel library and generator for Large Language Model (LLM) serving, targeting researchers and engineers building efficient inference systems. It provides optimized implementations of key LLM operations like attention and sampling, aiming to deliver state-of-the-art performance and memory efficiency.

How It Works

FlashInfer leverages custom CUDA kernels and a JIT compilation approach to offer highly optimized LLM operations. It features efficient sparse/dense attention kernels for both CUDA Cores and Tensor Cores, load-balanced scheduling for variable-length inputs, and memory-saving techniques like Cascade Attention and Head-Query fusion. The library also supports customizable attention variants and integrates with CUDAGraphs and torch.compile for reduced latency.

Quick Start & Requirements

Installation: pip install flashinfer-python (prebuilt wheels available for Linux with specific CUDA/PyTorch versions, e.g., pip install flashinfer-python -i https://flashinfer.ai/whl/cu126/torch2.6). Nightly builds and source installation are also supported.
Prerequisites: CUDA Toolkit, PyTorch. NVCC is required for JIT compilation from source.
Resources: Requires a CUDA-enabled GPU.
Documentation: https://docs.flashinfer.ai/

Highlighted Details

Achieves up to 90% of dense kernel bandwidth for vector-sparse attention.
Supports efficient low-precision and fused-RoPE attention for KV-Cache compression.
Provides sorting-free GPU kernels for Top-P, Top-K/Min-P sampling.
Compatible with PyTorch, TVM, and C++ (header-only) APIs.

Maintenance & Community

Active development with regular blog posts and feature releases (e.g., v0.2).
Community support via Slack: https://flashinfer.ai/slack
Discussion Forum: https://github.com/flashinfer-ai/flashinfer/discussions

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Prebuilt wheels are primarily for Linux; other platforms may require building from source.
The license is not clearly stated, which may impact commercial adoption.

Health Check

Last Commit

10 hours ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

378 stars in the last 30 days