Megatron-Bridge  by NVIDIA-NeMo

Scalable LLM training and conversion between Hugging Face and Megatron Core

Created 7 months ago
348 stars

Top 79.9% on SourcePulse

GitHubView on GitHub
Project Summary

This library addresses the need for seamless interoperability and efficient training of large language and vision-language models between the Hugging Face ecosystem and NVIDIA's Megatron Core. It targets researchers and engineers requiring advanced distributed training capabilities, offering a PyTorch-native solution for bidirectional model conversion, pretraining, and fine-tuning. The primary benefit is enabling users to leverage Megatron Core's parallelism and optimized training infrastructure with familiar Hugging Face models, thereby accelerating LLM/VLM development and deployment.

How It Works

Megatron-Bridge acts as a conversion and verification layer, facilitating bidirectional checkpoint conversion between Hugging Face and Megatron Core formats. It integrates a refactored, PyTorch-native training loop that leverages Megatron Core for advanced parallelism (tensor, pipeline) and mixed-precision training (FP8, BF16, FP4). The library supports using existing Hugging Face models or custom PyTorch definitions, with optimized paths for Transformer Engine, ensuring high throughput and scalability.

Quick Start & Requirements

The recommended installation is via the NeMo Framework container (nvcr.io/nvidia/nemo:${TAG}). A Python 3.10+ environment is required. Users must log in to Hugging Face Hub (huggingface-cli login). Launching training scripts typically uses torchrun.

Highlighted Details

  • Bidirectional Conversion: Seamlessly converts checkpoints between Hugging Face and Megatron formats, supporting online import/export with memory-efficient streaming.
  • Advanced Parallelism: Integrates Megatron Core's parallelism (TP/PP/VPP/CP/EP/ETP) and supports mixed-precision training (FP8, BF16, FP4).
  • Flexible Training: Offers a customizable PyTorch-native training loop for fine-grained control over data loading, distributed training, and evaluation.
  • PEFT & SFT: Implements Supervised Fine-Tuning and Parameter-Efficient Fine-Tuning methods like LoRA and DoRA.
  • SOTA Recipes: Provides production-ready training recipes for popular LLMs (e.g., Llama 3, Qwen2.5) with optimized configurations.
  • Performance: Engineered for high utilization and near-linear scalability across thousands of nodes.

Maintenance & Community

The project is a continuation of MBridge and has seen adoption by several organizations including veRL, slime, SkyRL, and Nemo-RL. Community contributions are acknowledged.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README, which may impact commercial use or integration decisions.

Limitations & Caveats

The "Supported Models" table indicates that full training/fine-tuning recipes or checkpoint conversion are "Coming soon" for several models, suggesting incomplete support for certain architectures. Installation primarily relies on a Docker container.

Health Check
Last Commit

13 hours ago

Responsiveness

Inactive

Pull Requests (30d)
162
Issues (30d)
101
Star History
87 stars in the last 30 days

Explore Similar Projects

Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

fms-fsdp by foundation-model-stack

0.7%
278
Efficiently train foundation models with PyTorch
Created 1 year ago
Updated 1 month ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

InternEvo by InternLM

0.2%
417
Lightweight training framework for model pre-training
Created 2 years ago
Updated 4 months ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI) and Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI).

Pai-Megatron-Patch by alibaba

0.7%
2k
Training toolkit for LLMs & VLMs using Megatron
Created 2 years ago
Updated 3 weeks ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), and
8 more.

h2o-llmstudio by h2oai

0.1%
5k
LLM Studio: framework for LLM fine-tuning via GUI or CLI
Created 2 years ago
Updated 3 weeks ago
Feedback? Help us improve.