DeepGEMM by deepseek-ai

CUDA library for efficient FP8 GEMM kernels with fine-grained scaling

Created 11 months ago

6,051 stars

Top 8.4% on SourcePulse

View on GitHub

11 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Eric Zhang

Founding Engineer at Modal

Nathan Lambert

Research Scientist at AI2

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

and 7 more!

Project Summary

DeepGEMM is a CUDA library for highly efficient FP8 General Matrix Multiplications (GEMMs), targeting researchers and engineers working with NVIDIA Hopper architecture. It provides optimized kernels for both standard and Mixture-of-Experts (MoE) workloads, enabling significant performance gains through fine-grained scaling and advanced hardware features.

How It Works

DeepGEMM leverages NVIDIA Hopper's Tensor Memory Accelerator (TMA) for efficient data movement and its FP8 tensor cores for computation. It employs a two-level accumulation strategy using CUDA cores to mitigate FP8 precision issues. The library utilizes a lightweight Just-In-Time (JIT) compilation approach, compiling kernels at runtime to optimize for specific shapes, block sizes, and pipeline stages, similar to Triton. It also incorporates techniques like FFMA SASS interleaving and unaligned block sizes to maximize hardware utilization.

Quick Start & Requirements

Install: python setup.py install
Prerequisites: NVIDIA Hopper GPUs (sm_90a), Python 3.8+, CUDA 12.3+ (12.8+ recommended), PyTorch 2.1+, CUTLASS 3.6+.
Setup: Clone with --recursive to include submodules.
Docs: Official Documentation and Test Code for examples.

Highlighted Details

Achieves up to 1550 TFLOPS on H800.
Supports normal, contiguous-grouped, and masked-grouped GEMMs for MoE.
Fully JIT design for shape-specific optimizations and register savings.
Utilizes TMA for asynchronous data movement and multicast.
Implements FFMA SASS interleaving for enhanced warp-level parallelism.

Maintenance & Community

The project is actively developed by deepseek-ai. Roadmap items indicate ongoing work on correctness, performance optimizations, and expanded kernel support.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

Currently, DeepGEMM exclusively supports NVIDIA Hopper architecture (sm_90a). It requires specific TMA alignment for the LHS matrix and only supports the NT (non-transposed LHS, transposed RHS) format, necessitating separate handling for transpositions or other FP8 casting operations.

Health Check

Last Commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

107 stars in the last 30 days