composer  by mosaicml

DL framework for training at scale, optimized for large-scale clusters

Created 3 years ago
5,410 stars

Top 9.4% on SourcePulse

GitHubView on GitHub
Project Summary

Composer is an open-source PyTorch library designed to simplify and accelerate deep learning model training at scale. It targets researchers and engineers working with large models like LLMs, diffusion models, and transformers, abstracting complexities of distributed training, data loading, and memory optimization to enable faster experimentation and iteration.

How It Works

Composer centers around a highly optimized Trainer abstraction that streamlines PyTorch training loops. It integrates advanced parallelism techniques like PyTorch FullyShardedDataParallelism (FSDP) and standard Distributed Data Parallelism (DDP) for efficient multi-node training. A flexible callback system allows users to inject custom logic at various training stages, while built-in speedup algorithms, inspired by recent research, can be composed into "recipes" to significantly boost training throughput.

Quick Start & Requirements

  • Installation: pip install mosaicml
  • Prerequisites: Python, PyTorch, CUDA-compatible GPUs (recommended).
  • Resources: Docker images are available for simplified environment setup.
  • Links: Website, Getting Started, Docs

Highlighted Details

  • Scalability: Supports training from 1 to 512 GPUs and datasets from 50MB to 10TB.
  • Elastic Checkpointing: Enables resuming training on different hardware configurations.
  • Data Streaming: Integrates with MosaicML StreamingDataset for on-the-fly data loading from cloud storage.
  • Workflow Automation: Features like auto-resumption and CUDA OOM prevention simplify training management.

Maintenance & Community

  • Actively developed by MosaicML, with contributions from the broader ML community.
  • Community support available via Slack.
  • Resources include tutorials for BERT, LLMs, and migrating from PyTorch Lightning.

Licensing & Compatibility

  • Apache 2.0 License.
  • Permissive license allows for commercial use and integration with closed-source projects.

Limitations & Caveats

  • The library is not recommended for Graph Neural Networks (GNNs), Generative Adversarial Networks (GANs), or reinforcement learning (RL) due to design assumptions that may be suboptimal for these domains.
Health Check
Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
5
Issues (30d)
1
Star History
17 stars in the last 30 days

Explore Similar Projects

Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
6 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
Created 11 months ago
Updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
20 more.

alpa by alpa-projects

0.0%
3k
Auto-parallelization framework for large-scale neural network training and serving
Created 4 years ago
Updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), and
13 more.

torchtitan by pytorch

0.7%
4k
PyTorch platform for generative AI model training research
Created 1 year ago
Updated 19 hours ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
26 more.

ColossalAI by hpcaitech

0.1%
41k
AI system for large-scale parallel training
Created 3 years ago
Updated 13 hours ago
Feedback? Help us improve.