torchtitan by pytorch

PyTorch platform for generative AI model training research

Created 2 years ago

4,945 stars

Top 10.0% on SourcePulse

View on GitHub

17 Experts Love This Project

Andrej Karpathy

Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n

Woosuk Kwon

Coauthor of vLLM

Pawel Garbacki

Cofounder of Fireworks AI

Lewis Tunstall

Research Engineer at Hugging Face

and 13 more!

Project Summary

TorchTitan is a PyTorch-native platform for large-scale generative AI model training, targeting researchers and developers seeking a flexible, minimal, and extensible framework. It aims to accelerate innovation by simplifying the implementation of advanced distributed training techniques for models like LLMs and diffusion models.

How It Works

TorchTitan leverages PyTorch's native scaling features, offering composable multi-dimensional parallelisms (Tensor, Pipeline, Context) and advanced techniques like FSDP2 with per-parameter sharding, activation checkpointing, and Float8 support. Its design prioritizes ease of understanding, minimal code modification for parallelism, and a clean, reusable component-based architecture.

Quick Start & Requirements

Install via pip install -r requirements.txt and pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126 --force-reinstall.
Requires PyTorch nightly builds (CUDA 12.6 recommended). AMD GPU support via ROCm 6.3 is available.
Tokenizer download script provided for Llama 3.1 models.
Official documentation and a quick-start guide are available.

Highlighted Details

Supports multi-dimensional composable parallelisms: FSDP2, Tensor Parallel (async TP), Pipeline Parallel, Context Parallel.
Features selective/full activation checkpointing, distributed checkpointing, and Float8 support.
Integrates with torch.compile and offers interoperable checkpoints loadable by torchtune.
Includes comprehensive logging (Tensorboard/W&B), debugging tools (profiling), and helper scripts for model conversion and memory estimation.

Maintenance & Community

Active development with recent updates including Llama 4 support and diffusion model experiments.
Presentations at PyTorch Conference 2024 and an upcoming ICLR 2025 poster.
Community discussion via the PyTorch forum.

Licensing & Compatibility

Licensed under BSD 3-Clause.
Users must adhere to separate licenses for third-party data and models.

Limitations & Caveats

The project is in a pre-release state, indicating potential for breaking changes and ongoing development. While showcasing Llama 3.1 training up to 512 GPUs, broader model support is experimental.

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

127 stars in the last 30 days