torch-profiling-tutorial  by Quentin-Anthony

PyTorch model profiling tutorial

Created 1 month ago
476 stars

Top 64.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a tutorial on profiling PyTorch models, focusing on identifying performance bottlenecks and improving GPU efficiency. It is targeted at researchers and engineers working with large language models or other deep learning architectures who need to optimize training loops. The tutorial demonstrates how to use the PyTorch profiler and interpret its output, leading to actionable optimization strategies.

How It Works

The tutorial guides users through a standard PyTorch training loop for a transformer model. It leverages the built-in PyTorch profiler to capture detailed performance metrics, including CPU and GPU execution times, memory usage, and GPU utilization. The approach emphasizes a step-by-step analysis, starting with a high-level overview from the PyTorch profiler and then delving into lower-level GPU metrics like SM efficiency and achieved occupancy to pinpoint inefficiencies.

Quick Start & Requirements

  • Installation: Clone the repository and use the provided Docker command for an NVIDIA environment:
    git clone https://github.com/Quentin-Anthony/torch-profiler-tutorial.git
    docker run --privileged --shm-size=1000gb --gpus all -it --rm -v ~/torch-profiler-tutorial:/torch-profiler-tutorial nvcr.io/nvidia/pytorch:23.10-py3
    
  • Prerequisites: NVIDIA GPU, Docker, NVIDIA Container Toolkit.
  • Running the tutorial: Execute python torch_prof.py within the container.
  • Viewing traces: Copy the ./log directory locally and run tensorboard --logdir=./log.
  • Resources: Requires a GPU-enabled system and sufficient disk space for Docker images and logs.

Highlighted Details

  • Demonstrates how to interpret PyTorch profiler output, including execution summaries and GPU utilization metrics.
  • Explains nuanced GPU efficiency metrics like Est. SM Efficiency and Est. Achieved Occupancy.
  • Provides a practical example of optimizing a transformer model by enabling FlashAttention, increasing batch size/sequence length, and using FP16 precision with torch.cuda.amp.autocast.
  • Shows how to link CPU operations to GPU kernels using trace viewers.

Maintenance & Community

The repository is maintained by Quentin Anthony. There are no explicit community channels or roadmap links provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README.

Limitations & Caveats

The tutorial mentions that the TensorBoard trace viewer can be RAM-intensive for large traces, suggesting alternatives like ui.perfetto.dev. It also notes that low-level GPU profilers (NVIDIA NSYS, AMD Rocprof) are marked as "TODO" and are not yet included. The initial setup involves a privileged Docker container, which might be a security consideration for some environments.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
35 stars in the last 30 days

Explore Similar Projects

Starred by Zhiqiang Xie Zhiqiang Xie(Author of SGLang), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

KernelBench by ScalingIntelligence

1.7%
538
Benchmark for LLMs generating GPU kernels from PyTorch ops
Created 10 months ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.3%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 9 hours ago
Starred by David Cournapeau David Cournapeau(Author of scikit-learn), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
5 more.

lectures by gpu-mode

1.0%
5k
Lecture series for GPU-accelerated computing
Created 1 year ago
Updated 2 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
33 more.

flash-attention by Dao-AILab

0.9%
19k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 2 days ago
Feedback? Help us improve.