autokernel  by RightNow-AI

Autonomous GPU kernel optimization for PyTorch

Created 2 days ago

New!

513 stars

Top 61.1% on SourcePulse

GitHubView on GitHub
Project Summary

Summary RightNow-AI/autokernel provides an autonomous agent for optimizing GPU kernels in PyTorch models. It targets engineers and researchers seeking maximum hardware performance by automatically identifying, optimizing, and verifying bottleneck kernels using Triton or CUDA C++. The core benefit is obtaining significantly faster, production-ready GPU kernels without manual optimization.

How It Works The system employs an agent-driven, iterative refinement loop. It profiles PyTorch models to pinpoint bottlenecks, extracts them into standalone Triton or CUDA C++ kernels, and autonomously optimizes each via an edit-benchmark-keep/revert cycle on kernel.py. Orchestration uses Amdahl's Law to prioritize optimizations yielding the greatest end-to-end speedup. All performance gains are validated against a rigorous 5-stage correctness harness (bench.py) before acceptance, ensuring functional integrity.

Quick Start & Requirements Requires uv (installable via curl), Python 3.10+, and an NVIDIA GPU (tested on H100/A100/RTX 4090). Key setup: uv run prepare.py. Initiate profiling with uv run profile.py, extract top kernels via uv run extract.py, and benchmark/verify with uv run bench.py. Agent interaction involves providing program.md instructions to an external coding agent (e.g., Claude, Codex).

Highlighted Details

  • Dual Backend: Supports Triton for rapid iteration and CUDA C++ for maximum performance, including tensor core utilization.
  • Correctness First: Rigorous 5-stage correctness checks precede performance measurements.
  • Amdahl's Law Orchestration: Prioritizes optimizations based on potential overall application speedup.
  • Single File Modification: Agent focuses changes on kernel.py, simplifying review and rollback.
  • KernelBench Integration & HuggingFace Export: Leverages the standard benchmark for AI-generated kernels via iterative refinement and allows exporting optimized kernels to the HuggingFace Hub.
  • TSV Logging: All experiment results logged to results.tsv for easy parsing.

Maintenance & Community Inspired by Andrej Karpathy's autoresearch methodology. KernelBench integration builds upon work from Stanford's Scaling Intelligence Lab. No specific community channels or prominent maintainer/sponsor details are provided.

Licensing & Compatibility Released under the MIT license, which is highly permissive and suitable for commercial use and integration into closed-source projects.

Limitations & Caveats Operation is strictly limited to NVIDIA GPUs. Autonomous agent functionality requires integration with external coding agents to interpret program.md. Specific performance targets across all supported kernel types are not detailed in the README.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
2
Star History
524 stars in the last 2 days

Explore Similar Projects

Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Zhuohan Li Zhuohan Li(Coauthor of vLLM), and
4 more.

mirage by mirage-project

0.3%
2k
Tool for fast GPU kernel generation via superoptimization
Created 1 year ago
Updated 1 day ago
Starred by David Cournapeau David Cournapeau(Author of scikit-learn), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
5 more.

lectures by gpu-mode

0.4%
6k
Lecture series for GPU-accelerated computing
Created 2 years ago
Updated 1 month ago
Feedback? Help us improve.