gpu-perf-engineering-resources  by wafer-ai

GPU performance engineering curriculum for AI infrastructure

Created 1 month ago
401 stars

Top 72.3% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This repository offers a comprehensive, tiered curriculum for engineers focused on GPU performance engineering for high-performance AI systems. It guides learners from fundamental GPU programming to cutting-edge techniques used in frontier AI labs, enabling effective optimization of AI infrastructure.

How It Works

The curriculum is structured into sequential tiers, covering GPU architecture, low-level programming (PTX, SASS), optimization for core operations (matmul, attention), and modern AI inference systems. It emphasizes foundational knowledge, practical insights from practitioner blogs, and official documentation, balancing fundamental concepts with advanced techniques.

Quick Start & Requirements

This is a learning curriculum, not a software project. It outlines a recommended reading order. Applying the learned concepts requires access to GPUs (NVIDIA, AMD), CUDA/ROCm toolkits, and potentially specific hardware architectures for advanced topics.

Highlighted Details

  • In-depth coverage of AI acceleration: FlashAttention (v1-v3), PagedAttention, KV cache optimization.
  • Exploration of compiler DSLs: OpenAI's Triton, NVIDIA's CUTLASS, Mojo.
  • Profiling and optimization using NVIDIA tools (Nsight Compute) and the Roofline model.
  • Resources for alternative hardware: AMD GPUs (ROCm) and Google TPUs.
  • Production inference systems: continuous batching, speculative decoding, LLM-generated kernels.

Maintenance & Community

Contributions prioritize primary sources and practitioner insights. The project fosters a large community via its active Discord server (23k+ members) and curated learning materials.

Licensing & Compatibility

The MIT license is permissive, allowing broad adoption and integration of learned principles in commercial and closed-source contexts.

Limitations & Caveats

As a curriculum, it lacks direct code execution or hands-on labs. It provides knowledge pointers, requiring users to set up their own environments. While covering AMD/TPUs, the primary focus and detail depth are on NVIDIA hardware and CUDA.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
108 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.1%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 17 hours ago
Starred by David Cournapeau David Cournapeau(Author of scikit-learn), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
5 more.

lectures by gpu-mode

0.5%
6k
Lecture series for GPU-accelerated computing
Created 2 years ago
Updated 3 weeks ago
Feedback? Help us improve.