ColossalAI by hpcaitech

AI system for large-scale parallel training

Created 4 years ago

41,316 stars

Top 0.7% on SourcePulse

View on GitHub

30 Experts Love This Project

Tobi Lutke

Cofounder of Shopify

Jeff Hammerbacher

Cofounder of Cloudera

Li Jiang

Coauthor of AutoGen; Engineer at Microsoft

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

and 26 more!

Project Summary

Colossal-AI is a unified deep learning system designed to make training and inference of large AI models more efficient, cost-effective, and accessible. It targets researchers and engineers working with massive models, offering a suite of parallelization strategies and memory optimization techniques to simplify distributed training and inference.

How It Works

Colossal-AI provides a comprehensive set of parallelization strategies, including Data Parallelism, Pipeline Parallelism, 1D/2D/2.5D/3D Tensor Parallelism, Sequence Parallelism, and Zero Redundancy Optimizer (ZeRO). It also features heterogeneous memory management (PatrickStar) and an auto-parallelism system. This multi-faceted approach allows users to scale their models across multiple GPUs and nodes with minimal code changes, abstracting away the complexities of distributed computing.

Quick Start & Requirements

Install: pip install colossalai (Linux only). For PyTorch extensions: BUILD_EXT=1 pip install colossalai. Nightly builds: pip install colossalai-nightly.
Requirements: PyTorch >= 2.2, Python >= 3.7, CUDA >= 11.0, NVIDIA GPU Compute Capability >= 7.0. Linux OS.
From Source: git clone the repository, cd ColossalAI, pip install . (with BUILD_EXT=1 for CUDA kernels).
Docker: docker build -t colossalai ./docker and run with docker run -ti --gpus all --rm --ipc=host colossalai bash.
Documentation: https://colossalai.readthedocs.io/en/latest/

Highlighted Details

Offers solutions for training and inference of large models like LLaMA, GPT, and Stable Diffusion.
Achieves significant speedups and memory reductions, e.g., 5.6x less memory for Stable Diffusion training, 2x faster inference with Colossal-Inference.
Features Open-Sora for Sora-like video generation and ColossalChat for RLHF pipeline implementation.
Supports advanced parallelism techniques including 3D Tensor Parallelism and ZeRO.

Maintenance & Community

Active development with regular updates and releases.
Community channels include Forum and Slack.
Welcomes contributions from developers and partners.
Cite Us: BibTeX citation provided.

Licensing & Compatibility

The project appears to be under a permissive license, but specific details are not explicitly stated in the README. Compatibility for commercial use should be verified.

Limitations & Caveats

Installation is currently Linux-only.
Building CUDA extensions requires specific setup steps.
Users with older CUDA versions (e.g., 10.2) may need to manually download and integrate the cub library.

Health Check

Last Commit

2 weeks ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

75 stars in the last 30 days