CUDA-L2  by deepreinforce-ai

AI-powered optimization for matrix multiplication

Created 1 month ago
300 stars

Top 88.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

CUDA-L2 addresses the performance bottleneck in Half-precision General Matrix Multiply (HGEMM) operations on GPUs. It leverages large language models (LLMs) and reinforcement learning (RL) to automatically generate highly optimized CUDA kernels, aiming to surpass the performance of established libraries like cuBLAS. This project is targeted at researchers and engineers working with deep learning models that heavily rely on matrix multiplication, offering significant speedups for HGEMM computations.

How It Works

The system employs a novel approach combining LLMs and RL to discover optimal HGEMM kernel configurations. It systematically explores the vast design space of CUDA kernel parameters, using RL to guide the search towards configurations that yield superior performance. This data-driven optimization process allows CUDA-L2 to outperform traditional, hand-tuned, or heuristic-based libraries by adapting to specific hardware characteristics and computational demands.

Quick Start & Requirements

  • Installation: Requires cloning the CUTLASS library (git clone -b v4.2.1 https://github.com/NVIDIA/cutlass.git cutlass).
  • Prerequisites: Python, PyTorch version 2.6.0 or higher.
  • Environment Variables: CUTLASS_DIR must point to the cloned CUTLASS directory, and TORCH_CUDA_ARCH_LIST must be set (e.g., "8.0" for A100/RTX 30 series).
  • Usage: The eval_one_file.sh script is used for evaluation in offline or server modes.
  • Links: CUTLASS repository: https://github.com/NVIDIA/cutlass.git (specific tag v4.2.1 required).

Highlighted Details

  • Systematically outperforms major matmul baselines including torch.matmul, cuBLAS, cuBLASLt-heuristic, and cuBLASLt-AutoTuning on A100 GPUs.
  • Released A100 optimized HGEMM kernels across 1,000 (M,N,K) configurations.

Maintenance & Community

  • Contact: GitHub issues or jiwei_li@deep-reinforce.com.
  • Roadmap: A "To-Do List" indicates planned extensions to 32-bit accumulators, denser configurations, broader GPU support (Ada Lovelace, Hopper, Blackwell), and easier deployment for open-source LLMs.

Licensing & Compatibility

  • License: Not explicitly stated in the README.
  • Compatibility: Kernels optimized for A100 are recommended for A100; performance on other GPUs is not guaranteed.

Limitations & Caveats

  • Current version only supports 16-bit accumulators for A100 HGEMM; 32-bit accumulator support is planned.
  • Performance is not guaranteed on GPUs other than the one used for kernel training (e.g., A100 kernels for A100).
  • Handling of matrix dimensions not found in configurations requires padding with zeros or requesting new kernels.

Citation: Su, Songqiao, et al. "CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning." arXiv preprint arXiv:2512.02551 (2025).

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
112 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI) and Zhuohan Li Zhuohan Li(Coauthor of vLLM).

TileGym by NVIDIA

4.5%
554
CUDA Tile kernel library for efficient GPU programming
Created 1 month ago
Updated 3 days ago
Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Zhuohan Li Zhuohan Li(Coauthor of vLLM), and
4 more.

mirage by mirage-project

1.6%
2k
Tool for fast GPU kernel generation via superoptimization
Created 1 year ago
Updated 3 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Eric Zhang Eric Zhang(Founding Engineer at Modal), and
9 more.

DeepGEMM by deepseek-ai

0.4%
6k
CUDA library for efficient FP8 GEMM kernels with fine-grained scaling
Created 11 months ago
Updated 5 days ago
Feedback? Help us improve.