KernelBench by ScalingIntelligence

Benchmark for LLMs generating GPU kernels from PyTorch ops

Created 1 year ago

747 stars

Top 46.6% on SourcePulse

View on GitHub

4 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Zhiqiang Xie

Coauthor of SGLang

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Akshat Bubna

Cofounder of Modal

Project Summary

KernelBench is a benchmark suite designed to evaluate the capability of Large Language Models (LLMs) to generate efficient GPU kernels. It targets researchers and engineers working on AI-driven code generation and optimization for deep learning hardware. The benchmark allows for systematic assessment of LLM-generated CUDA code against PyTorch implementations, measuring both correctness and performance speedup.

How It Works

The benchmark structures the task as transpiling PyTorch operator descriptions into CUDA kernels. It categorizes problems into four levels: single-kernel operators, simple fused patterns, full model architectures, and Hugging Face model architectures. Evaluation involves checking generated kernels for correctness against reference PyTorch operators and measuring performance speedup. A key metric, fast_p, quantifies the fraction of tasks that are both correct and achieve a speedup greater than a specified threshold p.

Quick Start & Requirements

Install: conda create --name kernel-bench python=3.10, conda activate kernel-bench, pip install -r requirements.txt, pip install -e .
Prerequisites: GPU required for running and profiling kernels. API keys for LLM providers are necessary. Modal can be used for GPU-less setup.
Links: Blog Post, arXiv, HuggingFace Dataset

Highlighted Details

Four distinct levels of complexity for kernel generation tasks.
Evaluation metric fast_p captures both correctness and performance.
Scripts provided for generating samples, evaluating kernels, and analyzing results.
Baseline timing data available, with recommendations to generate custom baselines.

Maintenance & Community

The project is associated with ScalingIntelligence. Notable usage examples include NVIDIA, METR, and Sakana AI.

Licensing & Compatibility

License: MIT.
Compatibility: Suitable for research and development. Commercial use is permitted under the MIT license terms.

Limitations & Caveats

The benchmark requires a GPU for execution. While baseline timing data is provided, generating custom baselines on target hardware is recommended for accurate performance comparisons due to potential variations in hardware, software versions, and cluster power.

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

47 stars in the last 30 days