lightning-thunder  by Lightning-AI

PyTorch compiler for model optimization via source-to-source transformation

created 1 year ago
1,384 stars

Top 29.8% on sourcepulse

GitHubView on GitHub
Project Summary

Thunder is a source-to-source compiler for PyTorch, designed to optimize model performance, memory usage, and parallelism. It targets both end-users seeking out-of-the-box speed-ups and performance experts needing an extensible framework for custom optimizations like kernel fusion, quantization, and distributed strategies.

How It Works

Thunder operates in three stages: it first acquires the PyTorch model by interpreting Python bytecode into a straight-line Python program. Next, it transforms this computation trace to incorporate optimizations such as distributed strategies or precision changes. Finally, it routes parts of the trace for execution using various backends, including NVFuser, torch.compile, specialized libraries (cuDNN SDPA, TransformerEngine), and custom Triton/CUDA kernels. This approach allows for composable transformations and easy swapping of optimizations.

Quick Start & Requirements

  • Install via pip: pip install torch==2.6.0 torchvision==0.21 nvfuser-cu124-torch26 followed by pip install lightning-thunder.
  • For Blackwell support, CUDA 12.8 is required, along with nightly PyTorch and nvfuser builds.
  • Optional executors like cuDNN SDPA and TransformerEngine (for Float8) can be installed separately.
  • See Examples for usage with LitGPT and Hugging Face models.

Highlighted Details

  • Claims up to 40% speed-up for PyTorch models.
  • Supports quantization, FP4/FP6/FP8 precision, kernel fusion, and various distributed training strategies (DDP, FSDP, TP).
  • Ready for NVIDIA Blackwell hardware and supports CUDA Graphs.
  • Integrates custom Triton kernels and offers plugins for easy optimization swapping.

Maintenance & Community

  • Developed in collaboration with the community, with significant contributions from NVIDIA.
  • Active development indicated by CI badges.
  • Community support available via Discord.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

  • Requires specific PyTorch and CUDA versions for optimal functionality, with advanced features like Blackwell support needing nightly builds.
  • Performance gains are hardware and model-dependent, as noted in the "may or may not make a big difference" caveat for specific optimizations.
Health Check
Last commit

18 hours ago

Responsiveness

1 day

Pull Requests (30d)
99
Issues (30d)
23
Star History
58 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 14 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
7 more.

ThunderKittens by HazyResearch

0.6%
3k
CUDA kernel framework for fast deep learning primitives
created 1 year ago
updated 3 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
5 more.

Liger-Kernel by linkedin

0.6%
5k
Triton kernels for efficient LLM training
created 1 year ago
updated 1 day ago
Feedback? Help us improve.