ai-performance-engineering by cfregly

AI Systems Performance Engineering for modern AI workloads

Created 8 months ago

905 stars

Top 40.1% on SourcePulse

View on GitHub

2 Experts Love This Project

George Hotz

Author of tinygrad; Founder of the tiny corp, comma.ai

Carol Willing

Core Contributor to CPython, Jupyter

Project Summary

Summary

This repository offers code, tooling, and resources for AI Systems Performance Engineering, focusing on optimizing GPU utilization, distributed training, and inference scaling. Aimed at AI/ML engineers and researchers, it provides a practical, profile-first methodology to build efficient, reliable AI pipelines, improving performance-per-watt and reducing costs.

How It Works

The project employs an empirical, hands-on approach centered on "goodput"-driven engineering. It guides users through diagnosing bottlenecks using profilers like Nsight and PyTorch's profiler. Core techniques involve optimizing memory bandwidth, leveraging compiler stacks (PyTorch, OpenAI Triton) for kernels, and implementing advanced parallelism strategies for training. For inference, it details methods for high-throughput serving using frameworks like vLLM, TensorRT-LLM, and NVIDIA Dynamo, including disaggregated prefill/decode and paged KV cache management.

Quick Start & Requirements

Installation: Clone the repo, navigate to a chapter's code directory (e.g., code/ch1), and install dependencies via pip install -r requirements.txt. Examples run with python <script_name>.py.
Prerequisites: NVIDIA GPU with CUDA, Python 3.8+, PyTorch with CUDA. Docker optional.
Target Stack: Optimized for NVIDIA Blackwell B200/B300, assuming CUDA 12.9, PyTorch 2.9 nightlies, and Triton 3.5.0. Helper scripts are provided.
Links: O'Reilly book, AI Performance Engineering Meetup, YouTube Channel.

Highlighted Details

Features a 175+ item performance checklist covering the AI lifecycle.
Includes thousands of lines of PyTorch and CUDA C++ code examples for NVIDIA GPUs.
Emphasizes optimizing for cost-per-token and performance-per-watt.
Offers a Unified Profiling Harness for Nsight and PyTorch profilers.

Maintenance & Community

Associated with a large community via monthly AI Performance Engineering meetups (100k+ members) and a YouTube channel. Contributions are welcomed via CONTRIBUTING.md.

Licensing & Compatibility

Released under the MIT License, permitting broad use, modification, and distribution, including for commercial purposes.

Limitations & Caveats

Tooling and examples are heavily optimized for specific, cutting-edge NVIDIA hardware (Blackwell B200/B300) and software versions (CUDA 12.9, PyTorch 2.9 nightlies), potentially limiting applicability on older or different stacks. Example workloads are designed for quick profiling and may require adjustments for large-scale testing.

Health Check

Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

170 stars in the last 30 days