Benchmark for LLMs generating GPU kernels from PyTorch ops
Top 63.1% on sourcepulse
KernelBench is a benchmark suite designed to evaluate the capability of Large Language Models (LLMs) to generate efficient GPU kernels. It targets researchers and engineers working on AI-driven code generation and optimization for deep learning hardware. The benchmark allows for systematic assessment of LLM-generated CUDA code against PyTorch implementations, measuring both correctness and performance speedup.
How It Works
The benchmark structures the task as transpiling PyTorch operator descriptions into CUDA kernels. It categorizes problems into four levels: single-kernel operators, simple fused patterns, full model architectures, and Hugging Face model architectures. Evaluation involves checking generated kernels for correctness against reference PyTorch operators and measuring performance speedup. A key metric, fast_p
, quantifies the fraction of tasks that are both correct and achieve a speedup greater than a specified threshold p
.
Quick Start & Requirements
conda create --name kernel-bench python=3.10
, conda activate kernel-bench
, pip install -r requirements.txt
, pip install -e .
Highlighted Details
fast_p
captures both correctness and performance.Maintenance & Community
The project is associated with ScalingIntelligence. Notable usage examples include NVIDIA, METR, and Sakana AI.
Licensing & Compatibility
Limitations & Caveats
The benchmark requires a GPU for execution. While baseline timing data is provided, generating custom baselines on target hardware is recommended for accurate performance comparisons due to potential variations in hardware, software versions, and cluster power.
2 days ago
1+ week