NeMo-Framework-Launcher  by NVIDIA

Cloud-native tool for launching NeMo framework training jobs

created 2 years ago
508 stars

Top 62.2% on sourcepulse

GitHubView on GitHub
Project Summary

The NeMo Framework Launcher is a cloud-native tool designed for launching end-to-end training pipelines for Large Language Models (LLMs) and multimodal foundation models. It targets researchers and engineers working with generative AI, simplifying the complex process of large-scale model training on diverse compute environments, from on-premises clusters to cloud platforms.

How It Works

The launcher orchestrates the entire LLM training lifecycle, including data preparation, model parallelism configuration, training, fine-tuning (SFT, PEFT), evaluation, and export. It leverages advanced training techniques such as Tensor Parallelism, Pipeline Parallelism, Sequence Parallelism, Distributed Optimizer, and mixed-precision training (FP8, BF16) to enable efficient scaling to thousands of GPUs for training on trillions of tokens. The tool generates and manages submission scripts for cluster schedulers, organizes job results, and supports custom container images.

Quick Start & Requirements

  • Install: git clone https://github.com/NVIDIA/NeMo-Framework-Launcher.git && cd NeMo-Framework-Launcher && pip install -r requirements.txt
  • Prerequisites: Python, compatible with NeMo version 1.0. Tested with NeMo Framework Container.
  • Usage: Configure .yaml files and run python main.py.
  • Docs: NeMo Launcher Guide, NeMo Framework Playbooks

Highlighted Details

  • Supports LLM pretraining and fine-tuning (SFT, PEFT) for models like GPT, BERT, and T5/MT5.
  • Scales training to thousands of GPUs and trillions of tokens.
  • Integrates advanced parallelism techniques (Tensor, Pipeline, Sequence, Distributed Optimizer).
  • Facilitates cluster setup, data management, and model deployment.

Maintenance & Community

Contributions are accepted via pull requests. Further community engagement details are not specified in the README.

Licensing & Compatibility

  • License: Apache 2.0 License.
  • Compatibility: Compatible with NeMo version 1.0 only. Designed for cloud-native and on-premises cluster deployment.

Limitations & Caveats

The launcher is strictly compatible with NeMo version 1.0, which may limit its applicability to users of newer NeMo Framework versions.

Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake) and Zhiqiang Xie Zhiqiang Xie(Author of SGLang).

veScale by volcengine

0.1%
839
PyTorch-native framework for LLM training
created 1 year ago
updated 3 weeks ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
8 more.

higgsfield by higgsfield-ai

0.3%
3k
ML framework for large model training and GPU orchestration
created 7 years ago
updated 1 year ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
6 more.

gpt-neox by EleutherAI

0.1%
7k
Framework for training large-scale autoregressive language models
created 4 years ago
updated 1 week ago
Feedback? Help us improve.