DeepGEMM  by deepseek-ai

CUDA library for efficient FP8 GEMM kernels with fine-grained scaling

Created 11 months ago
6,051 stars

Top 8.4% on SourcePulse

GitHubView on GitHub
Project Summary

DeepGEMM is a CUDA library for highly efficient FP8 General Matrix Multiplications (GEMMs), targeting researchers and engineers working with NVIDIA Hopper architecture. It provides optimized kernels for both standard and Mixture-of-Experts (MoE) workloads, enabling significant performance gains through fine-grained scaling and advanced hardware features.

How It Works

DeepGEMM leverages NVIDIA Hopper's Tensor Memory Accelerator (TMA) for efficient data movement and its FP8 tensor cores for computation. It employs a two-level accumulation strategy using CUDA cores to mitigate FP8 precision issues. The library utilizes a lightweight Just-In-Time (JIT) compilation approach, compiling kernels at runtime to optimize for specific shapes, block sizes, and pipeline stages, similar to Triton. It also incorporates techniques like FFMA SASS interleaving and unaligned block sizes to maximize hardware utilization.

Quick Start & Requirements

  • Install: python setup.py install
  • Prerequisites: NVIDIA Hopper GPUs (sm_90a), Python 3.8+, CUDA 12.3+ (12.8+ recommended), PyTorch 2.1+, CUTLASS 3.6+.
  • Setup: Clone with --recursive to include submodules.
  • Docs: Official Documentation and Test Code for examples.

Highlighted Details

  • Achieves up to 1550 TFLOPS on H800.
  • Supports normal, contiguous-grouped, and masked-grouped GEMMs for MoE.
  • Fully JIT design for shape-specific optimizations and register savings.
  • Utilizes TMA for asynchronous data movement and multicast.
  • Implements FFMA SASS interleaving for enhanced warp-level parallelism.

Maintenance & Community

The project is actively developed by deepseek-ai. Roadmap items indicate ongoing work on correctness, performance optimizations, and expanded kernel support.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

Currently, DeepGEMM exclusively supports NVIDIA Hopper architecture (sm_90a). It requires specific TMA alignment for the LHS matrix and only supports the NT (non-transposed LHS, transposed RHS) format, necessitating separate handling for transpositions or other FP8 casting operations.

Health Check
Last Commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)
33
Issues (30d)
3
Star History
107 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
17 more.

ThunderKittens by HazyResearch

0.5%
3k
CUDA kernel framework for fast deep learning primitives
Created 1 year ago
Updated 11 hours ago
Starred by David Cournapeau David Cournapeau(Author of scikit-learn), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
5 more.

lectures by gpu-mode

0.8%
6k
Lecture series for GPU-accelerated computing
Created 2 years ago
Updated 1 month ago
Starred by Tri Dao Tri Dao(Chief Scientist at Together AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
23 more.

cutlass by NVIDIA

0.5%
9k
CUDA C++ and Python DSLs for high-performance linear algebra
Created 8 years ago
Updated 2 days ago
Feedback? Help us improve.