FlashMLA by deepseek-ai

Efficient CUDA kernels for MLA decoding

Created 10 months ago

11,961 stars

Top 4.2% on SourcePulse

View on GitHub

11 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Jeff Hammerbacher

Cofounder of Cloudera

Jeffrey Morgan

Cofounder of Ollama

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

and 7 more!

Project Summary

FlashMLA provides highly optimized CUDA kernels for efficient Multi-Head Linear Attention (MLA) decoding on NVIDIA Hopper GPUs, targeting large language model inference. It offers significant speedups for compute-bound workloads by leveraging techniques like paged KV cache and optimized tiling, benefiting researchers and engineers working on high-throughput LLM serving.

How It Works

FlashMLA implements MLA decoding kernels optimized for Hopper architectures, supporting BF16 and FP16 precision. It utilizes paged KV cache with a block size of 64 and employs advanced tiling strategies inspired by FlashAttention and CUTLASS to maximize throughput. This approach is advantageous for compute-intensive scenarios where the number of query heads multiplied by tokens per request exceeds 64, achieving high TFLOPS.

Quick Start & Requirements

Install: python setup.py install
Requirements: Hopper GPUs, CUDA 12.3+ (12.8+ recommended), PyTorch 2.0+.
Benchmark: python tests/test_flash_mla.py
Documentation: deep-dive write-up

Highlighted Details

Achieves up to 660 TFLOPS on NVIDIA H800 SXM5 GPUs for compute-bound workloads.
Delivers 5%-15% performance improvement on compute-bound workloads.
Supports BF16, FP16 precision, and paged KV cache with block size 64.
Optimized for variable-length sequence serving.

Maintenance & Community

The project is actively maintained by DeepSeek AI. It acknowledges inspiration from FlashAttention 2&3 and CUTLASS. Community versions are available for MetaX, Moore Threads, Hygon DCU, Intellifusion, Iluvatar Corex, and AMD Instinct GPUs.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The new kernel primarily targets compute-intensive settings; for memory-bound cases, version b31bfe7 is recommended. Compatibility with older CUDA versions or non-Hopper NVIDIA architectures is not specified.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

72 stars in the last 30 days