Discover and explore top open-source AI tools and projects—updated daily.
cfreglyAI Systems Performance Engineering for modern AI workloads
Top 62.8% on SourcePulse
Summary
This repository offers code, tooling, and resources for AI Systems Performance Engineering, focusing on optimizing GPU utilization, distributed training, and inference scaling. Aimed at AI/ML engineers and researchers, it provides a practical, profile-first methodology to build efficient, reliable AI pipelines, improving performance-per-watt and reducing costs.
How It Works
The project employs an empirical, hands-on approach centered on "goodput"-driven engineering. It guides users through diagnosing bottlenecks using profilers like Nsight and PyTorch's profiler. Core techniques involve optimizing memory bandwidth, leveraging compiler stacks (PyTorch, OpenAI Triton) for kernels, and implementing advanced parallelism strategies for training. For inference, it details methods for high-throughput serving using frameworks like vLLM, TensorRT-LLM, and NVIDIA Dynamo, including disaggregated prefill/decode and paged KV cache management.
Quick Start & Requirements
code/ch1), and install dependencies via pip install -r requirements.txt. Examples run with python <script_name>.py.Highlighted Details
Maintenance & Community
Associated with a large community via monthly AI Performance Engineering meetups (100k+ members) and a YouTube channel. Contributions are welcomed via CONTRIBUTING.md.
Licensing & Compatibility
Released under the MIT License, permitting broad use, modification, and distribution, including for commercial purposes.
Limitations & Caveats
Tooling and examples are heavily optimized for specific, cutting-edge NVIDIA hardware (Blackwell B200/B300) and software versions (CUDA 12.9, PyTorch 2.9 nightlies), potentially limiting applicability on older or different stacks. Example workloads are designed for quick profiling and may require adjustments for large-scale testing.
1 day ago
Inactive
microsoft
ELS-RD
mryab
Lightning-AI
NVIDIA
openvinotoolkit