Megatron-LM by NVIDIA

Framework for training transformer models at scale

Created 6 years ago

14,861 stars

Top 3.3% on SourcePulse

View on GitHub

24 Experts Love This Project

Jiaming Song

Chief Scientist at Luma AI

and 20 more!

Project Summary

Megatron-LM and Megatron-Core provide a research framework and a library of GPU-optimized techniques for training transformer models at scale. It is designed for researchers and developers working with large language models, offering advanced parallelism and memory-saving features for efficient training on NVIDIA hardware.

How It Works

Megatron-Core offers composable, modular APIs for GPU-optimized building blocks like attention mechanisms, transformer layers, and normalization. It supports advanced model parallelism (tensor, sequence, pipeline, context, MoE) and data parallelism, enabling efficient training of models with hundreds of billions of parameters. Techniques like activation recomputation, distributed optimizers, and FlashAttention further reduce memory usage and improve training speed.

Quick Start & Requirements

Installation: Recommended via NGC's PyTorch container. Docker commands provided for setup.
Prerequisites: Latest PyTorch, CUDA, NCCL, NVIDIA APEX. NLTK for data preprocessing.
Resources: Requires NVIDIA GPUs (Hopper architecture support for FP8). Training examples scale up to 6144 H100 GPUs.
Documentation: Megatron-Core Developer Guide

Highlighted Details

Supports advanced parallelism: tensor, sequence, pipeline, context, and MoE expert parallelism.
Features memory optimization techniques: activation checkpointing, distributed optimizer, FlashAttention.
Enables efficient training of models with hundreds of billions of parameters, demonstrating strong scaling on H100 GPUs.
Offers tools for checkpoint conversion between different model classes and formats.

Maintenance & Community

Actively developed by NVIDIA, with recent updates including Mamba support and multimodal training enhancements.
Links to documentation and examples are provided.

Licensing & Compatibility

License: OpenBSD.
Compatible with NVIDIA accelerated computing infrastructure and Tensor Core GPUs.

Limitations & Caveats

FlashAttention is non-deterministic; use --use-flash-attn with caution if bitwise reproducibility is critical. Transformer Engine requires NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 for determinism. Determinism verified in NGC PyTorch containers >= 23.12.

Health Check

Last Commit

14 hours ago

Responsiveness

1 day

Pull Requests (30d)

306

Issues (30d)

179

Star History

342 stars in the last 30 days