FlashMLA  by deepseek-ai

Efficient CUDA kernels for MLA decoding

Created 6 months ago
11,722 stars

Top 4.3% on SourcePulse

GitHubView on GitHub
Project Summary

FlashMLA provides highly optimized CUDA kernels for efficient Multi-Head Linear Attention (MLA) decoding on NVIDIA Hopper GPUs, targeting large language model inference. It offers significant speedups for compute-bound workloads by leveraging techniques like paged KV cache and optimized tiling, benefiting researchers and engineers working on high-throughput LLM serving.

How It Works

FlashMLA implements MLA decoding kernels optimized for Hopper architectures, supporting BF16 and FP16 precision. It utilizes paged KV cache with a block size of 64 and employs advanced tiling strategies inspired by FlashAttention and CUTLASS to maximize throughput. This approach is advantageous for compute-intensive scenarios where the number of query heads multiplied by tokens per request exceeds 64, achieving high TFLOPS.

Quick Start & Requirements

  • Install: python setup.py install
  • Requirements: Hopper GPUs, CUDA 12.3+ (12.8+ recommended), PyTorch 2.0+.
  • Benchmark: python tests/test_flash_mla.py
  • Documentation: deep-dive write-up

Highlighted Details

  • Achieves up to 660 TFLOPS on NVIDIA H800 SXM5 GPUs for compute-bound workloads.
  • Delivers 5%-15% performance improvement on compute-bound workloads.
  • Supports BF16, FP16 precision, and paged KV cache with block size 64.
  • Optimized for variable-length sequence serving.

Maintenance & Community

The project is actively maintained by DeepSeek AI. It acknowledges inspiration from FlashAttention 2&3 and CUTLASS. Community versions are available for MetaX, Moore Threads, Hygon DCU, Intellifusion, Iluvatar Corex, and AMD Instinct GPUs.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The new kernel primarily targets compute-intensive settings; for memory-bound cases, version b31bfe7 is recommended. Compatibility with older CUDA versions or non-Hopper NVIDIA architectures is not specified.

Health Check
Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
6
Issues (30d)
6
Star History
47 stars in the last 30 days

Explore Similar Projects

Starred by Chris Lattner Chris Lattner(Author of LLVM, Clang, Swift, Mojo, MLIR; Cofounder of Modular), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
18 more.

open-infra-index by deepseek-ai

0.1%
8k
AI infrastructure tools for efficient AGI development
Created 6 months ago
Updated 4 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Starred by David Cournapeau David Cournapeau(Author of scikit-learn), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
5 more.

lectures by gpu-mode

0.8%
5k
Lecture series for GPU-accelerated computing
Created 1 year ago
Updated 4 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
20k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 1 day ago
Feedback? Help us improve.