Megatron-LM  by NVIDIA

Framework for training transformer models at scale

created 6 years ago
13,024 stars

Top 3.9% on sourcepulse

GitHubView on GitHub
Project Summary

Megatron-LM and Megatron-Core provide a research framework and a library of GPU-optimized techniques for training transformer models at scale. It is designed for researchers and developers working with large language models, offering advanced parallelism and memory-saving features for efficient training on NVIDIA hardware.

How It Works

Megatron-Core offers composable, modular APIs for GPU-optimized building blocks like attention mechanisms, transformer layers, and normalization. It supports advanced model parallelism (tensor, sequence, pipeline, context, MoE) and data parallelism, enabling efficient training of models with hundreds of billions of parameters. Techniques like activation recomputation, distributed optimizers, and FlashAttention further reduce memory usage and improve training speed.

Quick Start & Requirements

  • Installation: Recommended via NGC's PyTorch container. Docker commands provided for setup.
  • Prerequisites: Latest PyTorch, CUDA, NCCL, NVIDIA APEX. NLTK for data preprocessing.
  • Resources: Requires NVIDIA GPUs (Hopper architecture support for FP8). Training examples scale up to 6144 H100 GPUs.
  • Documentation: Megatron-Core Developer Guide

Highlighted Details

  • Supports advanced parallelism: tensor, sequence, pipeline, context, and MoE expert parallelism.
  • Features memory optimization techniques: activation checkpointing, distributed optimizer, FlashAttention.
  • Enables efficient training of models with hundreds of billions of parameters, demonstrating strong scaling on H100 GPUs.
  • Offers tools for checkpoint conversion between different model classes and formats.

Maintenance & Community

  • Actively developed by NVIDIA, with recent updates including Mamba support and multimodal training enhancements.
  • Links to documentation and examples are provided.

Licensing & Compatibility

  • License: OpenBSD.
  • Compatible with NVIDIA accelerated computing infrastructure and Tensor Core GPUs.

Limitations & Caveats

FlashAttention is non-deterministic; use --use-flash-attn with caution if bitwise reproducibility is critical. Transformer Engine requires NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 for determinism. Determinism verified in NGC PyTorch containers >= 23.12.

Health Check
Last commit

23 hours ago

Responsiveness

1 week

Pull Requests (30d)
97
Issues (30d)
185
Star History
839 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Zhuohan Li Zhuohan Li(Author of vLLM), and
6 more.

torchtitan by pytorch

0.9%
4k
PyTorch platform for generative AI model training research
created 1 year ago
updated 18 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
5 more.

TensorRT-LLM by NVIDIA

0.6%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 1 year ago
updated 14 hours ago
Feedback? Help us improve.