NeMo-Framework-Launcher  by NVIDIA

Cloud-native tool for launching NeMo framework training jobs

Created 2 years ago
511 stars

Top 61.2% on SourcePulse

GitHubView on GitHub
Project Summary

The NeMo Framework Launcher is a cloud-native tool designed for launching end-to-end training pipelines for Large Language Models (LLMs) and multimodal foundation models. It targets researchers and engineers working with generative AI, simplifying the complex process of large-scale model training on diverse compute environments, from on-premises clusters to cloud platforms.

How It Works

The launcher orchestrates the entire LLM training lifecycle, including data preparation, model parallelism configuration, training, fine-tuning (SFT, PEFT), evaluation, and export. It leverages advanced training techniques such as Tensor Parallelism, Pipeline Parallelism, Sequence Parallelism, Distributed Optimizer, and mixed-precision training (FP8, BF16) to enable efficient scaling to thousands of GPUs for training on trillions of tokens. The tool generates and manages submission scripts for cluster schedulers, organizes job results, and supports custom container images.

Quick Start & Requirements

  • Install: git clone https://github.com/NVIDIA/NeMo-Framework-Launcher.git && cd NeMo-Framework-Launcher && pip install -r requirements.txt
  • Prerequisites: Python, compatible with NeMo version 1.0. Tested with NeMo Framework Container.
  • Usage: Configure .yaml files and run python main.py.
  • Docs: NeMo Launcher Guide, NeMo Framework Playbooks

Highlighted Details

  • Supports LLM pretraining and fine-tuning (SFT, PEFT) for models like GPT, BERT, and T5/MT5.
  • Scales training to thousands of GPUs and trillions of tokens.
  • Integrates advanced parallelism techniques (Tensor, Pipeline, Sequence, Distributed Optimizer).
  • Facilitates cluster setup, data management, and model deployment.

Maintenance & Community

Contributions are accepted via pull requests. Further community engagement details are not specified in the README.

Licensing & Compatibility

  • License: Apache 2.0 License.
  • Compatibility: Compatible with NeMo version 1.0 only. Designed for cloud-native and on-premises cluster deployment.

Limitations & Caveats

The launcher is strictly compatible with NeMo version 1.0, which may limit its applicability to users of newer NeMo Framework versions.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
25 more.

gpt-neox by EleutherAI

0.2%
7k
Framework for training large-scale autoregressive language models
Created 4 years ago
Updated 2 days ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
26 more.

ColossalAI by hpcaitech

0.1%
41k
AI system for large-scale parallel training
Created 3 years ago
Updated 13 hours ago
Feedback? Help us improve.