flashinfer  by flashinfer-ai

Kernel library for LLM serving

Created 2 years ago
4,564 stars

Top 10.7% on SourcePulse

GitHubView on GitHub
Project Summary

FlashInfer is a high-performance kernel library and generator for Large Language Model (LLM) serving, targeting researchers and engineers building efficient inference systems. It provides optimized implementations of key LLM operations like attention and sampling, aiming to deliver state-of-the-art performance and memory efficiency.

How It Works

FlashInfer leverages custom CUDA kernels and a JIT compilation approach to offer highly optimized LLM operations. It features efficient sparse/dense attention kernels for both CUDA Cores and Tensor Cores, load-balanced scheduling for variable-length inputs, and memory-saving techniques like Cascade Attention and Head-Query fusion. The library also supports customizable attention variants and integrates with CUDAGraphs and torch.compile for reduced latency.

Quick Start & Requirements

  • Installation: pip install flashinfer-python (prebuilt wheels available for Linux with specific CUDA/PyTorch versions, e.g., pip install flashinfer-python -i https://flashinfer.ai/whl/cu126/torch2.6). Nightly builds and source installation are also supported.
  • Prerequisites: CUDA Toolkit, PyTorch. NVCC is required for JIT compilation from source.
  • Resources: Requires a CUDA-enabled GPU.
  • Documentation: https://docs.flashinfer.ai/

Highlighted Details

  • Achieves up to 90% of dense kernel bandwidth for vector-sparse attention.
  • Supports efficient low-precision and fused-RoPE attention for KV-Cache compression.
  • Provides sorting-free GPU kernels for Top-P, Top-K/Min-P sampling.
  • Compatible with PyTorch, TVM, and C++ (header-only) APIs.

Maintenance & Community

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • Prebuilt wheels are primarily for Linux; other platforms may require building from source.
  • The license is not clearly stated, which may impact commercial adoption.
Health Check
Last Commit

10 hours ago

Responsiveness

1 day

Pull Requests (30d)
89
Issues (30d)
47
Star History
378 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Philipp Schmid Philipp Schmid(DevRel at Google DeepMind), and
2 more.

SageAttention by thu-ml

1.3%
3k
Attention kernel for plug-and-play inference acceleration
Created 1 year ago
Updated 2 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
12 more.

Liger-Kernel by linkedin

0.5%
6k
Triton kernels for efficient LLM training
Created 1 year ago
Updated 3 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
22k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 22 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
60 more.

vllm by vllm-project

0.7%
67k
LLM serving engine for high-throughput, memory-efficient inference
Created 2 years ago
Updated 9 hours ago
Feedback? Help us improve.