Discover and explore top open-source AI tools and projects—updated daily.
NVIDIA-NeMoScalable LLM training and conversion between Hugging Face and Megatron Core
Top 79.9% on SourcePulse
This library addresses the need for seamless interoperability and efficient training of large language and vision-language models between the Hugging Face ecosystem and NVIDIA's Megatron Core. It targets researchers and engineers requiring advanced distributed training capabilities, offering a PyTorch-native solution for bidirectional model conversion, pretraining, and fine-tuning. The primary benefit is enabling users to leverage Megatron Core's parallelism and optimized training infrastructure with familiar Hugging Face models, thereby accelerating LLM/VLM development and deployment.
How It Works
Megatron-Bridge acts as a conversion and verification layer, facilitating bidirectional checkpoint conversion between Hugging Face and Megatron Core formats. It integrates a refactored, PyTorch-native training loop that leverages Megatron Core for advanced parallelism (tensor, pipeline) and mixed-precision training (FP8, BF16, FP4). The library supports using existing Hugging Face models or custom PyTorch definitions, with optimized paths for Transformer Engine, ensuring high throughput and scalability.
Quick Start & Requirements
The recommended installation is via the NeMo Framework container (nvcr.io/nvidia/nemo:${TAG}). A Python 3.10+ environment is required. Users must log in to Hugging Face Hub (huggingface-cli login). Launching training scripts typically uses torchrun.
Highlighted Details
Maintenance & Community
The project is a continuation of MBridge and has seen adoption by several organizations including veRL, slime, SkyRL, and Nemo-RL. Community contributions are acknowledged.
Licensing & Compatibility
The repository's license is not explicitly stated in the provided README, which may impact commercial use or integration decisions.
Limitations & Caveats
The "Supported Models" table indicates that full training/fine-tuning recipes or checkpoint conversion are "Coming soon" for several models, suggesting incomplete support for certain architectures. Installation primarily relies on a Docker container.
13 hours ago
Inactive
foundation-model-stack
InternLM
alibaba
philschmid
h2oai