Training toolkit for large AI models
Top 55.0% on sourcepulse
BMTrain is an open-source toolkit designed for efficient large-scale model training, including pre-training and fine-tuning. It targets researchers and engineers working with models containing tens of billions of parameters, simplifying distributed training to feel like stand-alone development.
How It Works
BMTrain integrates with PyTorch, enabling distributed training through its init_distributed
function, replacing PyTorch's native distributed module. It implements ZeRO optimization by requiring users to replace torch.nn.Module
with bmtrain.DistributedModule
and torch.nn.Parameter
with bmtrain.DistributedParameter
. Transformer blocks can be further optimized by wrapping them in bmtrain.Block
with specified ZeRO levels. Communication overhead is reduced by using bmtrain.TransformerBlockList
for sequential blocks.
Quick Start & Requirements
pip install bmtrain
(compiles C/CUDA source code, may take 5-10 minutes).bmt.init_distributed()
, replace PyTorch modules with BMTrain equivalents, and launch using torch.distributed.launch
or torchrun
.Highlighted Details
bmtrain.optim.AdamOffloadOptimizer
and bmtrain.lr_scheduler
for optimized training.OptimManager
to handle optimizer zero-grad, backward, clipping, and step operations.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
BMTrain makes deep modifications to PyTorch's internals, potentially leading to unexpected behavior. Users are advised to submit issues for any observed problems.
2 months ago
1 week