BMTrain  by OpenBMB

Training toolkit for large AI models

Created 3 years ago
609 stars

Top 53.9% on SourcePulse

GitHubView on GitHub
Project Summary

BMTrain is an open-source toolkit designed for efficient large-scale model training, including pre-training and fine-tuning. It targets researchers and engineers working with models containing tens of billions of parameters, simplifying distributed training to feel like stand-alone development.

How It Works

BMTrain integrates with PyTorch, enabling distributed training through its init_distributed function, replacing PyTorch's native distributed module. It implements ZeRO optimization by requiring users to replace torch.nn.Module with bmtrain.DistributedModule and torch.nn.Parameter with bmtrain.DistributedParameter. Transformer blocks can be further optimized by wrapping them in bmtrain.Block with specified ZeRO levels. Communication overhead is reduced by using bmtrain.TransformerBlockList for sequential blocks.

Quick Start & Requirements

  • Installation: pip install bmtrain (compiles C/CUDA source code, may take 5-10 minutes).
  • Prerequisites: PyTorch, CUDA (implied by compilation).
  • Usage: Initialize with bmt.init_distributed(), replace PyTorch modules with BMTrain equivalents, and launch using torch.distributed.launch or torchrun.
  • Documentation: https://www.openbmb.org/ (linked via website).

Highlighted Details

  • Supports ZeRO-2 and ZeRO-3 optimizations.
  • Offers bmtrain.optim.AdamOffloadOptimizer and bmtrain.lr_scheduler for optimized training.
  • Provides an OptimManager to handle optimizer zero-grad, backward, clipping, and step operations.
  • Claims significant throughput improvements over standard ZeRO implementations in benchmarks.

Maintenance & Community

Licensing & Compatibility

  • Apache 2.0 License.
  • Permits commercial use and linking with closed-source projects.

Limitations & Caveats

BMTrain makes deep modifications to PyTorch's internals, potentially leading to unexpected behavior. Users are advised to submit issues for any observed problems.

Health Check
Last Commit

3 weeks ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm_training_handbook by huggingface

0%
511
Handbook for large language model training methodologies
Created 2 years ago
Updated 1 year ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
20 more.

accelerate by huggingface

0.3%
9k
PyTorch training helper for distributed execution
Created 4 years ago
Updated 1 day ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
26 more.

ColossalAI by hpcaitech

0.1%
41k
AI system for large-scale parallel training
Created 3 years ago
Updated 15 hours ago
Feedback? Help us improve.