Framework for training transformer models at scale
Top 3.9% on sourcepulse
Megatron-LM and Megatron-Core provide a research framework and a library of GPU-optimized techniques for training transformer models at scale. It is designed for researchers and developers working with large language models, offering advanced parallelism and memory-saving features for efficient training on NVIDIA hardware.
How It Works
Megatron-Core offers composable, modular APIs for GPU-optimized building blocks like attention mechanisms, transformer layers, and normalization. It supports advanced model parallelism (tensor, sequence, pipeline, context, MoE) and data parallelism, enabling efficient training of models with hundreds of billions of parameters. Techniques like activation recomputation, distributed optimizers, and FlashAttention further reduce memory usage and improve training speed.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
FlashAttention is non-deterministic; use --use-flash-attn
with caution if bitwise reproducibility is critical. Transformer Engine requires NVTE_ALLOW_NONDETERMINISTIC_ALGO=0
for determinism. Determinism verified in NGC PyTorch containers >= 23.12.
23 hours ago
1 week