composer by mosaicml

DL framework for training at scale, optimized for large-scale clusters

Created 4 years ago

5,455 stars

Top 9.2% on SourcePulse

View on GitHub

22 Experts Love This Project

Aravind Srinivas

Cofounder of Perplexity

Eiso Kant

Cofounder of Poolside AI

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Maxime Labonne

Head of Post-Training at Liquid AI

and 18 more!

Project Summary

Composer is an open-source PyTorch library designed to simplify and accelerate deep learning model training at scale. It targets researchers and engineers working with large models like LLMs, diffusion models, and transformers, abstracting complexities of distributed training, data loading, and memory optimization to enable faster experimentation and iteration.

How It Works

Composer centers around a highly optimized Trainer abstraction that streamlines PyTorch training loops. It integrates advanced parallelism techniques like PyTorch FullyShardedDataParallelism (FSDP) and standard Distributed Data Parallelism (DDP) for efficient multi-node training. A flexible callback system allows users to inject custom logic at various training stages, while built-in speedup algorithms, inspired by recent research, can be composed into "recipes" to significantly boost training throughput.

Quick Start & Requirements

Installation: pip install mosaicml
Prerequisites: Python, PyTorch, CUDA-compatible GPUs (recommended).
Resources: Docker images are available for simplified environment setup.
Links: Website, Getting Started, Docs

Highlighted Details

Scalability: Supports training from 1 to 512 GPUs and datasets from 50MB to 10TB.
Elastic Checkpointing: Enables resuming training on different hardware configurations.
Data Streaming: Integrates with MosaicML StreamingDataset for on-the-fly data loading from cloud storage.
Workflow Automation: Features like auto-resumption and CUDA OOM prevention simplify training management.

Maintenance & Community

Actively developed by MosaicML, with contributions from the broader ML community.
Community support available via Slack.
Resources include tutorials for BERT, LLMs, and migrating from PyTorch Lightning.

Licensing & Compatibility

Apache 2.0 License.
Permissive license allows for commercial use and integration with closed-source projects.

Limitations & Caveats

The library is not recommended for Graph Neural Networks (GNNs), Generative Adversarial Networks (GANs), or reinforcement learning (RL) due to design assumptions that may be suboptimal for these domains.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

14 stars in the last 30 days