DeepGEMM  by deepseek-ai

CUDA library for efficient FP8 GEMM kernels with fine-grained scaling

created 5 months ago
5,568 stars

Top 9.3% on sourcepulse

GitHubView on GitHub
Project Summary

DeepGEMM is a CUDA library for highly efficient FP8 General Matrix Multiplications (GEMMs), targeting researchers and engineers working with NVIDIA Hopper architecture. It provides optimized kernels for both standard and Mixture-of-Experts (MoE) workloads, enabling significant performance gains through fine-grained scaling and advanced hardware features.

How It Works

DeepGEMM leverages NVIDIA Hopper's Tensor Memory Accelerator (TMA) for efficient data movement and its FP8 tensor cores for computation. It employs a two-level accumulation strategy using CUDA cores to mitigate FP8 precision issues. The library utilizes a lightweight Just-In-Time (JIT) compilation approach, compiling kernels at runtime to optimize for specific shapes, block sizes, and pipeline stages, similar to Triton. It also incorporates techniques like FFMA SASS interleaving and unaligned block sizes to maximize hardware utilization.

Quick Start & Requirements

  • Install: python setup.py install
  • Prerequisites: NVIDIA Hopper GPUs (sm_90a), Python 3.8+, CUDA 12.3+ (12.8+ recommended), PyTorch 2.1+, CUTLASS 3.6+.
  • Setup: Clone with --recursive to include submodules.
  • Docs: Official Documentation and Test Code for examples.

Highlighted Details

  • Achieves up to 1550 TFLOPS on H800.
  • Supports normal, contiguous-grouped, and masked-grouped GEMMs for MoE.
  • Fully JIT design for shape-specific optimizations and register savings.
  • Utilizes TMA for asynchronous data movement and multicast.
  • Implements FFMA SASS interleaving for enhanced warp-level parallelism.

Maintenance & Community

The project is actively developed by deepseek-ai. Roadmap items indicate ongoing work on correctness, performance optimizations, and expanded kernel support.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

Currently, DeepGEMM exclusively supports NVIDIA Hopper architecture (sm_90a). It requires specific TMA alignment for the LHS matrix and only supports the NT (non-transposed LHS, transposed RHS) format, necessitating separate handling for transpositions or other FP8 casting operations.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
12
Issues (30d)
18
Star History
319 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
7 more.

ThunderKittens by HazyResearch

0.6%
3k
CUDA kernel framework for fast deep learning primitives
created 1 year ago
updated 3 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
5 more.

Liger-Kernel by linkedin

0.6%
5k
Triton kernels for efficient LLM training
created 1 year ago
updated 1 day ago
Feedback? Help us improve.