FlashMLA  by deepseek-ai

Efficient CUDA kernels for MLA decoding

created 5 months ago
11,668 stars

Top 4.4% on sourcepulse

GitHubView on GitHub
Project Summary

FlashMLA provides highly optimized CUDA kernels for efficient Multi-Head Linear Attention (MLA) decoding on NVIDIA Hopper GPUs, targeting large language model inference. It offers significant speedups for compute-bound workloads by leveraging techniques like paged KV cache and optimized tiling, benefiting researchers and engineers working on high-throughput LLM serving.

How It Works

FlashMLA implements MLA decoding kernels optimized for Hopper architectures, supporting BF16 and FP16 precision. It utilizes paged KV cache with a block size of 64 and employs advanced tiling strategies inspired by FlashAttention and CUTLASS to maximize throughput. This approach is advantageous for compute-intensive scenarios where the number of query heads multiplied by tokens per request exceeds 64, achieving high TFLOPS.

Quick Start & Requirements

  • Install: python setup.py install
  • Requirements: Hopper GPUs, CUDA 12.3+ (12.8+ recommended), PyTorch 2.0+.
  • Benchmark: python tests/test_flash_mla.py
  • Documentation: deep-dive write-up

Highlighted Details

  • Achieves up to 660 TFLOPS on NVIDIA H800 SXM5 GPUs for compute-bound workloads.
  • Delivers 5%-15% performance improvement on compute-bound workloads.
  • Supports BF16, FP16 precision, and paged KV cache with block size 64.
  • Optimized for variable-length sequence serving.

Maintenance & Community

The project is actively maintained by DeepSeek AI. It acknowledges inspiration from FlashAttention 2&3 and CUTLASS. Community versions are available for MetaX, Moore Threads, Hygon DCU, Intellifusion, Iluvatar Corex, and AMD Instinct GPUs.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The new kernel primarily targets compute-intensive settings; for memory-bound cases, version b31bfe7 is recommended. Compatibility with older CUDA versions or non-Hopper NVIDIA architectures is not specified.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
2
Star History
210 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
16 more.

flash-attention by Dao-AILab

0.7%
19k
Fast, memory-efficient attention implementation
created 3 years ago
updated 18 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Tobi Lutke Tobi Lutke(Cofounder of Shopify), and
27 more.

vllm by vllm-project

1.0%
54k
LLM serving engine for high-throughput, memory-efficient inference
created 2 years ago
updated 14 hours ago
Feedback? Help us improve.