autokernel by RightNow-AI

Autonomous GPU kernel optimization for PyTorch

Created 3 months ago

1,400 stars

Top 28.5% on SourcePulse

View on GitHub

5 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Pawel Garbacki

Cofounder of Fireworks AI

Wing Lian

Founder of Axolotl AI

Stas Bekman

Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake

and 1 more!

Project Summary

Summary RightNow-AI/autokernel provides an autonomous agent for optimizing GPU kernels in PyTorch models. It targets engineers and researchers seeking maximum hardware performance by automatically identifying, optimizing, and verifying bottleneck kernels using Triton or CUDA C++. The core benefit is obtaining significantly faster, production-ready GPU kernels without manual optimization.

How It Works The system employs an agent-driven, iterative refinement loop. It profiles PyTorch models to pinpoint bottlenecks, extracts them into standalone Triton or CUDA C++ kernels, and autonomously optimizes each via an edit-benchmark-keep/revert cycle on kernel.py. Orchestration uses Amdahl's Law to prioritize optimizations yielding the greatest end-to-end speedup. All performance gains are validated against a rigorous 5-stage correctness harness (bench.py) before acceptance, ensuring functional integrity.

Quick Start & Requirements Requires uv (installable via curl), Python 3.10+, and an NVIDIA GPU (tested on H100/A100/RTX 4090). Key setup: uv run prepare.py. Initiate profiling with uv run profile.py, extract top kernels via uv run extract.py, and benchmark/verify with uv run bench.py. Agent interaction involves providing program.md instructions to an external coding agent (e.g., Claude, Codex).

Highlighted Details

Dual Backend: Supports Triton for rapid iteration and CUDA C++ for maximum performance, including tensor core utilization.
Correctness First: Rigorous 5-stage correctness checks precede performance measurements.
Amdahl's Law Orchestration: Prioritizes optimizations based on potential overall application speedup.
Single File Modification: Agent focuses changes on kernel.py, simplifying review and rollback.
KernelBench Integration & HuggingFace Export: Leverages the standard benchmark for AI-generated kernels via iterative refinement and allows exporting optimized kernels to the HuggingFace Hub.
TSV Logging: All experiment results logged to results.tsv for easy parsing.

Maintenance & Community Inspired by Andrej Karpathy's autoresearch methodology. KernelBench integration builds upon work from Stanford's Scaling Intelligence Lab. No specific community channels or prominent maintainer/sponsor details are provided.

Licensing & Compatibility Released under the MIT license, which is highly permissive and suitable for commercial use and integration into closed-source projects.

Limitations & Caveats Operation is strictly limited to NVIDIA GPUs. Autonomous agent functionality requires integration with external coding agents to interpret program.md. Specific performance targets across all supported kernel types are not detailed in the README.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

49 stars in the last 30 days