FlashMLA  by deepseek-ai

Efficient CUDA kernels for MLA decoding

Created 1 year ago
12,665 stars

Top 4.1% on SourcePulse

GitHubView on GitHub
Project Summary

FlashMLA provides highly optimized CUDA kernels for efficient Multi-Head Linear Attention (MLA) decoding on NVIDIA Hopper GPUs, targeting large language model inference. It offers significant speedups for compute-bound workloads by leveraging techniques like paged KV cache and optimized tiling, benefiting researchers and engineers working on high-throughput LLM serving.

How It Works

FlashMLA implements MLA decoding kernels optimized for Hopper architectures, supporting BF16 and FP16 precision. It utilizes paged KV cache with a block size of 64 and employs advanced tiling strategies inspired by FlashAttention and CUTLASS to maximize throughput. This approach is advantageous for compute-intensive scenarios where the number of query heads multiplied by tokens per request exceeds 64, achieving high TFLOPS.

Quick Start & Requirements

  • Install: python setup.py install
  • Requirements: Hopper GPUs, CUDA 12.3+ (12.8+ recommended), PyTorch 2.0+.
  • Benchmark: python tests/test_flash_mla.py
  • Documentation: deep-dive write-up

Highlighted Details

  • Achieves up to 660 TFLOPS on NVIDIA H800 SXM5 GPUs for compute-bound workloads.
  • Delivers 5%-15% performance improvement on compute-bound workloads.
  • Supports BF16, FP16 precision, and paged KV cache with block size 64.
  • Optimized for variable-length sequence serving.

Maintenance & Community

The project is actively maintained by DeepSeek AI. It acknowledges inspiration from FlashAttention 2&3 and CUTLASS. Community versions are available for MetaX, Moore Threads, Hygon DCU, Intellifusion, Iluvatar Corex, and AMD Instinct GPUs.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The new kernel primarily targets compute-intensive settings; for memory-bound cases, version b31bfe7 is recommended. Compatibility with older CUDA versions or non-Hopper NVIDIA architectures is not specified.

Health Check
Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
4
Issues (30d)
0
Star History
87 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.8%
5k
High-performance C++ LLM inference library
Created 3 years ago
Updated 10 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.3%
24k
Fast, memory-efficient attention implementation
Created 4 years ago
Updated 1 day ago
Feedback? Help us improve.