torch-profiling-tutorial by Quentin-Anthony

PyTorch model profiling tutorial

Created 4 months ago

532 stars

Top 59.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Pawel Garbacki

Cofounder of Fireworks AI

Project Summary

This repository provides a tutorial on profiling PyTorch models, focusing on identifying performance bottlenecks and improving GPU efficiency. It is targeted at researchers and engineers working with large language models or other deep learning architectures who need to optimize training loops. The tutorial demonstrates how to use the PyTorch profiler and interpret its output, leading to actionable optimization strategies.

How It Works

The tutorial guides users through a standard PyTorch training loop for a transformer model. It leverages the built-in PyTorch profiler to capture detailed performance metrics, including CPU and GPU execution times, memory usage, and GPU utilization. The approach emphasizes a step-by-step analysis, starting with a high-level overview from the PyTorch profiler and then delving into lower-level GPU metrics like SM efficiency and achieved occupancy to pinpoint inefficiencies.

Quick Start & Requirements

Installation: Clone the repository and use the provided Docker command for an NVIDIA environment:

git clone https://github.com/Quentin-Anthony/torch-profiler-tutorial.git
docker run --privileged --shm-size=1000gb --gpus all -it --rm -v ~/torch-profiler-tutorial:/torch-profiler-tutorial nvcr.io/nvidia/pytorch:23.10-py3

Prerequisites: NVIDIA GPU, Docker, NVIDIA Container Toolkit.
Running the tutorial: Execute python torch_prof.py within the container.
Viewing traces: Copy the ./log directory locally and run tensorboard --logdir=./log.
Resources: Requires a GPU-enabled system and sufficient disk space for Docker images and logs.

Highlighted Details

Demonstrates how to interpret PyTorch profiler output, including execution summaries and GPU utilization metrics.
Explains nuanced GPU efficiency metrics like Est. SM Efficiency and Est. Achieved Occupancy.
Provides a practical example of optimizing a transformer model by enabling FlashAttention, increasing batch size/sequence length, and using FP16 precision with torch.cuda.amp.autocast.
Shows how to link CPU operations to GPU kernels using trace viewers.

Maintenance & Community

The repository is maintained by Quentin Anthony. There are no explicit community channels or roadmap links provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README.

Limitations & Caveats

The tutorial mentions that the TensorBoard trace viewer can be RAM-intensive for large traces, suggesting alternatives like ui.perfetto.dev. It also notes that low-level GPU profilers (NVIDIA NSYS, AMD Rocprof) are marked as "TODO" and are not yet included. The initial setup involves a privileged Docker container, which might be a security consideration for some environments.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

17 stars in the last 30 days