Discover and explore top open-source AI tools and projects—updated daily.
deepreinforce-aiAI-powered optimization for matrix multiplication
Top 88.9% on SourcePulse
CUDA-L2 addresses the performance bottleneck in Half-precision General Matrix Multiply (HGEMM) operations on GPUs. It leverages large language models (LLMs) and reinforcement learning (RL) to automatically generate highly optimized CUDA kernels, aiming to surpass the performance of established libraries like cuBLAS. This project is targeted at researchers and engineers working with deep learning models that heavily rely on matrix multiplication, offering significant speedups for HGEMM computations.
How It Works
The system employs a novel approach combining LLMs and RL to discover optimal HGEMM kernel configurations. It systematically explores the vast design space of CUDA kernel parameters, using RL to guide the search towards configurations that yield superior performance. This data-driven optimization process allows CUDA-L2 to outperform traditional, hand-tuned, or heuristic-based libraries by adapting to specific hardware characteristics and computational demands.
Quick Start & Requirements
git clone -b v4.2.1 https://github.com/NVIDIA/cutlass.git cutlass).CUTLASS_DIR must point to the cloned CUTLASS directory, and TORCH_CUDA_ARCH_LIST must be set (e.g., "8.0" for A100/RTX 30 series).eval_one_file.sh script is used for evaluation in offline or server modes.https://github.com/NVIDIA/cutlass.git (specific tag v4.2.1 required).Highlighted Details
Maintenance & Community
jiwei_li@deep-reinforce.com.Licensing & Compatibility
Limitations & Caveats
Citation: Su, Songqiao, et al. "CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning." arXiv preprint arXiv:2512.02551 (2025).
3 days ago
Inactive
NVIDIA
mirage-project
baidu-research
deepseek-ai