CUDA library for efficient FP8 GEMM kernels with fine-grained scaling
Top 9.3% on sourcepulse
DeepGEMM is a CUDA library for highly efficient FP8 General Matrix Multiplications (GEMMs), targeting researchers and engineers working with NVIDIA Hopper architecture. It provides optimized kernels for both standard and Mixture-of-Experts (MoE) workloads, enabling significant performance gains through fine-grained scaling and advanced hardware features.
How It Works
DeepGEMM leverages NVIDIA Hopper's Tensor Memory Accelerator (TMA) for efficient data movement and its FP8 tensor cores for computation. It employs a two-level accumulation strategy using CUDA cores to mitigate FP8 precision issues. The library utilizes a lightweight Just-In-Time (JIT) compilation approach, compiling kernels at runtime to optimize for specific shapes, block sizes, and pipeline stages, similar to Triton. It also incorporates techniques like FFMA SASS interleaving and unaligned block sizes to maximize hardware utilization.
Quick Start & Requirements
python setup.py install
--recursive
to include submodules.Highlighted Details
Maintenance & Community
The project is actively developed by deepseek-ai. Roadmap items indicate ongoing work on correctness, performance optimizations, and expanded kernel support.
Licensing & Compatibility
Limitations & Caveats
Currently, DeepGEMM exclusively supports NVIDIA Hopper architecture (sm_90a). It requires specific TMA alignment for the LHS matrix and only supports the NT (non-transposed LHS, transposed RHS) format, necessitating separate handling for transpositions or other FP8 casting operations.
1 day ago
1 day