AKO4ALL  by TongmingLAIC

Agentic kernel optimization for any hardware

Created 2 months ago
283 stars

Top 92.3% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

AKO4ALL automates GPU kernel optimization across diverse hardware, languages, and kernel types. It targets engineers and researchers, accelerating development by iteratively profiling, editing, and benchmarking to achieve expert-level performance, often surpassing established optimized libraries.

How It Works

The system employs an iterative agentic loop, initiated by dropping a kernel into a working directory and invoking it via a coding agent. AKO4ALL bootstraps a workspace, analyzes the kernel and inputs, and refines code through profiling, benchmarking, and logging. It can dynamically switch languages (e.g., Triton to CUDA) and use web searches for strategies when progress stalls, continuing until performance gains plateau.

Quick Start & Requirements

Install by cloning the repo into a coding agent's skills directory (e.g., ~/.claude/skills/ako4all) or creating a symlink. Requirements include a coding agent (e.g., Claude Code), NVIDIA GPU with CUDA, PyTorch (for built-in evaluator), Python >= 3.10, and NVIDIA Nsight Compute (version-matched). Optimization typically completes in under an hour.

Highlighted Details

  • Delivers expert-beating GPU kernel performance, often in under an hour, notably sweeping FlashInfer's expert kernels on NVIDIA B200.
  • Achieved significant geomean speedups against FlashInfer's expert baseline, including 1.36x for GQA paged decode and 1.50x for MLA paged prefill.
  • Functions
Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
128 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.