CUDA-L2 by deepreinforce-ai

AI-powered optimization for matrix multiplication

Created 2 months ago

422 stars

Top 69.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Luis Capelo

Cofounder of Lightning AI

Project Summary

CUDA-L2 addresses the performance bottleneck in Half-precision General Matrix Multiply (HGEMM) operations on GPUs. It leverages large language models (LLMs) and reinforcement learning (RL) to automatically generate highly optimized CUDA kernels, aiming to surpass the performance of established libraries like cuBLAS. This project is targeted at researchers and engineers working with deep learning models that heavily rely on matrix multiplication, offering significant speedups for HGEMM computations.

How It Works

The system employs a novel approach combining LLMs and RL to discover optimal HGEMM kernel configurations. It systematically explores the vast design space of CUDA kernel parameters, using RL to guide the search towards configurations that yield superior performance. This data-driven optimization process allows CUDA-L2 to outperform traditional, hand-tuned, or heuristic-based libraries by adapting to specific hardware characteristics and computational demands.

Quick Start & Requirements

Installation: Requires cloning the CUTLASS library (git clone -b v4.2.1 https://github.com/NVIDIA/cutlass.git cutlass).
Prerequisites: Python, PyTorch version 2.6.0 or higher.
Environment Variables: CUTLASS_DIR must point to the cloned CUTLASS directory, and TORCH_CUDA_ARCH_LIST must be set (e.g., "8.0" for A100/RTX 30 series).
Usage: The eval_one_file.sh script is used for evaluation in offline or server modes.
Links: CUTLASS repository: https://github.com/NVIDIA/cutlass.git (specific tag v4.2.1 required).

Highlighted Details

Systematically outperforms major matmul baselines including torch.matmul, cuBLAS, cuBLASLt-heuristic, and cuBLASLt-AutoTuning on A100 GPUs.
Released A100 optimized HGEMM kernels across 1,000 (M,N,K) configurations.

Maintenance & Community

Contact: GitHub issues or jiwei_li@deep-reinforce.com.
Roadmap: A "To-Do List" indicates planned extensions to 32-bit accumulators, denser configurations, broader GPU support (Ada Lovelace, Hopper, Blackwell), and easier deployment for open-source LLMs.

Licensing & Compatibility

License: Not explicitly stated in the README.
Compatibility: Kernels optimized for A100 are recommended for A100; performance on other GPUs is not guaranteed.

Limitations & Caveats

Current version only supports 16-bit accumulators for A100 HGEMM; 32-bit accumulator support is planned.
Performance is not guaranteed on GPUs other than the one used for kernel training (e.g., A100 kernels for A100).
Handling of matrix dimensions not found in configurations requires padding with zeros or requesting new kernels.

Citation: Su, Songqiao, et al. "CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning." arXiv preprint arXiv:2512.02551 (2025).

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

52 stars in the last 30 days