lightning-thunder  by Lightning-AI

PyTorch compiler for model optimization via source-to-source transformation

Created 1 year ago
1,411 stars

Top 28.8% on SourcePulse

GitHubView on GitHub
Project Summary

Thunder is a source-to-source compiler for PyTorch, designed to optimize model performance, memory usage, and parallelism. It targets both end-users seeking out-of-the-box speed-ups and performance experts needing an extensible framework for custom optimizations like kernel fusion, quantization, and distributed strategies.

How It Works

Thunder operates in three stages: it first acquires the PyTorch model by interpreting Python bytecode into a straight-line Python program. Next, it transforms this computation trace to incorporate optimizations such as distributed strategies or precision changes. Finally, it routes parts of the trace for execution using various backends, including NVFuser, torch.compile, specialized libraries (cuDNN SDPA, TransformerEngine), and custom Triton/CUDA kernels. This approach allows for composable transformations and easy swapping of optimizations.

Quick Start & Requirements

  • Install via pip: pip install torch==2.6.0 torchvision==0.21 nvfuser-cu124-torch26 followed by pip install lightning-thunder.
  • For Blackwell support, CUDA 12.8 is required, along with nightly PyTorch and nvfuser builds.
  • Optional executors like cuDNN SDPA and TransformerEngine (for Float8) can be installed separately.
  • See Examples for usage with LitGPT and Hugging Face models.

Highlighted Details

  • Claims up to 40% speed-up for PyTorch models.
  • Supports quantization, FP4/FP6/FP8 precision, kernel fusion, and various distributed training strategies (DDP, FSDP, TP).
  • Ready for NVIDIA Blackwell hardware and supports CUDA Graphs.
  • Integrates custom Triton kernels and offers plugins for easy optimization swapping.

Maintenance & Community

  • Developed in collaboration with the community, with significant contributions from NVIDIA.
  • Active development indicated by CI badges.
  • Community support available via Discord.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

  • Requires specific PyTorch and CUDA versions for optimal functionality, with advanced features like Blackwell support needing nightly builds.
  • Performance gains are hardware and model-dependent, as noted in the "may or may not make a big difference" caveat for specific optimizations.
Health Check
Last Commit

11 hours ago

Responsiveness

1 day

Pull Requests (30d)
104
Issues (30d)
20
Star History
22 stars in the last 30 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
1 more.

VeOmni by ByteDance-Seed

3.4%
1k
Framework for scaling multimodal model training across accelerators
Created 5 months ago
Updated 3 weeks ago
Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
4 more.

parallelformers by tunib-ai

0%
790
Toolkit for easy model parallelization
Created 4 years ago
Updated 2 years ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), and
15 more.

torchtune by pytorch

0.2%
5k
PyTorch library for LLM post-training and experimentation
Created 1 year ago
Updated 1 day ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
15 more.

FasterTransformer by NVIDIA

0.1%
6k
Optimized transformer library for inference
Created 4 years ago
Updated 1 year ago
Feedback? Help us improve.