KernelBench  by ScalingIntelligence

Benchmark for LLMs generating GPU kernels from PyTorch ops

created 9 months ago
499 stars

Top 63.1% on sourcepulse

GitHubView on GitHub
Project Summary

KernelBench is a benchmark suite designed to evaluate the capability of Large Language Models (LLMs) to generate efficient GPU kernels. It targets researchers and engineers working on AI-driven code generation and optimization for deep learning hardware. The benchmark allows for systematic assessment of LLM-generated CUDA code against PyTorch implementations, measuring both correctness and performance speedup.

How It Works

The benchmark structures the task as transpiling PyTorch operator descriptions into CUDA kernels. It categorizes problems into four levels: single-kernel operators, simple fused patterns, full model architectures, and Hugging Face model architectures. Evaluation involves checking generated kernels for correctness against reference PyTorch operators and measuring performance speedup. A key metric, fast_p, quantifies the fraction of tasks that are both correct and achieve a speedup greater than a specified threshold p.

Quick Start & Requirements

  • Install: conda create --name kernel-bench python=3.10, conda activate kernel-bench, pip install -r requirements.txt, pip install -e .
  • Prerequisites: GPU required for running and profiling kernels. API keys for LLM providers are necessary. Modal can be used for GPU-less setup.
  • Links: Blog Post, arXiv, HuggingFace Dataset

Highlighted Details

  • Four distinct levels of complexity for kernel generation tasks.
  • Evaluation metric fast_p captures both correctness and performance.
  • Scripts provided for generating samples, evaluating kernels, and analyzing results.
  • Baseline timing data available, with recommendations to generate custom baselines.

Maintenance & Community

The project is associated with ScalingIntelligence. Notable usage examples include NVIDIA, METR, and Sakana AI.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: Suitable for research and development. Commercial use is permitted under the MIT license terms.

Limitations & Caveats

The benchmark requires a GPU for execution. While baseline timing data is provided, generating custom baselines on target hardware is recommended for accurate performance comparisons due to potential variations in hardware, software versions, and cluster power.

Health Check
Last commit

2 days ago

Responsiveness

1+ week

Pull Requests (30d)
9
Issues (30d)
2
Star History
220 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.