ai-performance-engineering  by cfregly

AI Systems Performance Engineering for modern AI workloads

Created 6 months ago
492 stars

Top 62.8% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This repository offers code, tooling, and resources for AI Systems Performance Engineering, focusing on optimizing GPU utilization, distributed training, and inference scaling. Aimed at AI/ML engineers and researchers, it provides a practical, profile-first methodology to build efficient, reliable AI pipelines, improving performance-per-watt and reducing costs.

How It Works

The project employs an empirical, hands-on approach centered on "goodput"-driven engineering. It guides users through diagnosing bottlenecks using profilers like Nsight and PyTorch's profiler. Core techniques involve optimizing memory bandwidth, leveraging compiler stacks (PyTorch, OpenAI Triton) for kernels, and implementing advanced parallelism strategies for training. For inference, it details methods for high-throughput serving using frameworks like vLLM, TensorRT-LLM, and NVIDIA Dynamo, including disaggregated prefill/decode and paged KV cache management.

Quick Start & Requirements

  • Installation: Clone the repo, navigate to a chapter's code directory (e.g., code/ch1), and install dependencies via pip install -r requirements.txt. Examples run with python <script_name>.py.
  • Prerequisites: NVIDIA GPU with CUDA, Python 3.8+, PyTorch with CUDA. Docker optional.
  • Target Stack: Optimized for NVIDIA Blackwell B200/B300, assuming CUDA 12.9, PyTorch 2.9 nightlies, and Triton 3.5.0. Helper scripts are provided.
  • Links: O'Reilly book, AI Performance Engineering Meetup, YouTube Channel.

Highlighted Details

  • Features a 175+ item performance checklist covering the AI lifecycle.
  • Includes thousands of lines of PyTorch and CUDA C++ code examples for NVIDIA GPUs.
  • Emphasizes optimizing for cost-per-token and performance-per-watt.
  • Offers a Unified Profiling Harness for Nsight and PyTorch profilers.

Maintenance & Community

Associated with a large community via monthly AI Performance Engineering meetups (100k+ members) and a YouTube channel. Contributions are welcomed via CONTRIBUTING.md.

Licensing & Compatibility

Released under the MIT License, permitting broad use, modification, and distribution, including for commercial purposes.

Limitations & Caveats

Tooling and examples are heavily optimized for specific, cutting-edge NVIDIA hardware (Blackwell B200/B300) and software versions (CUDA 12.9, PyTorch 2.9 nightlies), potentially limiting applicability on older or different stacks. Example workloads are designed for quick profiling and may require adjustments for large-scale testing.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
360 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
3 more.

LitServe by Lightning-AI

1.2%
4k
AI inference pipeline framework
Created 1 year ago
Updated 12 hours ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
20 more.

TensorRT-LLM by NVIDIA

0.4%
12k
LLM inference optimization SDK for NVIDIA GPUs
Created 2 years ago
Updated 9 hours ago
Feedback? Help us improve.