Kernel library for LLM serving
Top 14.3% on sourcepulse
FlashInfer is a high-performance kernel library and generator for Large Language Model (LLM) serving, targeting researchers and engineers building efficient inference systems. It provides optimized implementations of key LLM operations like attention and sampling, aiming to deliver state-of-the-art performance and memory efficiency.
How It Works
FlashInfer leverages custom CUDA kernels and a JIT compilation approach to offer highly optimized LLM operations. It features efficient sparse/dense attention kernels for both CUDA Cores and Tensor Cores, load-balanced scheduling for variable-length inputs, and memory-saving techniques like Cascade Attention and Head-Query fusion. The library also supports customizable attention variants and integrates with CUDAGraphs and torch.compile
for reduced latency.
Quick Start & Requirements
pip install flashinfer-python
(prebuilt wheels available for Linux with specific CUDA/PyTorch versions, e.g., pip install flashinfer-python -i https://flashinfer.ai/whl/cu126/torch2.6
). Nightly builds and source installation are also supported.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
12 hours ago
1 day