CUDA-Agent  by BytedTsinghua-SIA

Agentic RL for high-performance CUDA kernel generation

Created 2 months ago
938 stars

Top 38.8% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

CUDA-Agent addresses the challenge of generating high-performance CUDA kernels by employing a novel Large-Scale Agentic Reinforcement Learning approach. Designed for researchers and engineers optimizing GPU computations, it offers state-of-the-art performance, significantly outperforming existing LLMs and compilation baselines on complex kernel generation tasks.

How It Works

This project utilizes an RL-trained model to generate CUDA kernels, achieving superior results on the KernelBench benchmark. Its core innovation lies in an agentic workspace (agent_workdir) that orchestrates a full development loop: generating kernels, compiling them, verifying correctness, profiling performance, and iterating based on feedback. This structured, iterative approach allows for targeted optimization beyond standard compilation methods.

Quick Start & Requirements

The project provides an agent_workdir with scripts for compilation (utils/compile.sh), correctness verification (utils/verification.py), and performance profiling (utils/profiling.py). A 6,000-sample training dataset, CUDA-Agent-Ops-6K, is also released. Setup likely requires a CUDA-enabled environment and Python 3. Specific hardware requirements (e.g., GPU model, VRAM) and detailed installation steps are not explicitly provided in the README. Links to the dataset are available.

Highlighted Details

  • Achieves state-of-the-art performance on KernelBench, surpassing advanced LLMs like Claude Opus-4.6 and Gemini 3 Pro.
  • Consistently outperforms the torch.compile baseline, especially on challenging kernel generation tasks.
  • Released the CUDA-Agent-Ops-6K training dataset, including its construction pipeline and filtering criteria.
  • Provides an agent environment with workflow constraints (SKILL.md) and a full development loop implementation.

Maintenance & Community

The provided README does not contain information regarding maintainers, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

The README does not specify the project's license. This lack of information presents a significant barrier to assessing compatibility for commercial use or integration into closed-source projects.

Limitations & Caveats

Features such as agent trace results and a web demo are noted as forthcoming ("Please stay tuned"). The README focuses primarily on the generation capabilities and benchmark results, with limited detail on the underlying RL training infrastructure or comprehensive setup requirements. The absence of a stated license is a critical caveat.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
66 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pankaj Gupta Pankaj Gupta(Cofounder of Baseten), and
1 more.

cccl by NVIDIA

0.6%
2k
CUDA C++ building blocks for high-performance GPU computing
Created 5 years ago
Updated 5 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Eric Zhang Eric Zhang(Founding Engineer at Modal), and
9 more.

DeepGEMM by deepseek-ai

2.7%
7k
CUDA library for efficient FP8 GEMM kernels with fine-grained scaling
Created 1 year ago
Updated 4 days ago
Feedback? Help us improve.