varuna  by microsoft

Tool for efficient large DNN model training on commodity hardware

created 4 years ago
251 stars

Top 99.8% on sourcepulse

GitHubView on GitHub
Project Summary

Varuna is a PyTorch library designed for efficient, scalable, and cost-effective training of large deep learning models on commodity hardware. It targets researchers and practitioners working with massive models that exceed the memory capacity of single GPUs, offering a solution that combines pipeline and data parallelism with dynamic resource adaptation.

How It Works

Varuna implements a hybrid parallelism strategy, interleaving pipeline parallelism (PP) and data parallelism (DP). Models are partitioned into sequential stages using CutPoint annotations within the model definition. These stages are then distributed across available GPUs. Data parallelism is applied across replicas of this pipeline. This approach allows for efficient utilization of memory and compute by breaking down large models and distributing them, while the hybrid nature aims to balance communication and computation overheads.

Quick Start & Requirements

  • Installation: Requires Python 3, PyTorch (1.5+), and Apex. Apex must be patched using the provided apex.patch before building.
    git clone https://github.com/NVIDIA/apex
    cp apex.patch /path/to/apex/
    cd /path/to/apex
    git apply apex.patch
    pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
    
    Then, install Varuna:
    git clone <varuna_repo>
    cd varuna
    python setup.py install
    
  • Prerequisites: PyTorch, Apex (with patch), Python 3.
  • Launch: Use run_varuna.py for distributed execution.
  • Docs: Available in the docs/ folder (html/index.html, varuna.pdf). Examples for BERT and Megatron-LM are in examples/.

Highlighted Details

  • Implements pipeline parallelism and data parallelism for large model training.
  • Supports dynamic resource scaling ("job morphing") via signal handling for checkpointing and relaunching.
  • Includes an auto-configuration module that profiles model/network performance to suggest optimal parallelism settings.
  • Handles FP16 mixed-precision training and parameter sharing across stages.

Maintenance & Community

  • Based on the paper "Varuna: Scalable, Low-cost Training of Massive Deep Learning Models" (EuroSys'22).
  • No explicit community links (Discord/Slack) or active contributor information provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. The dependency Apex is typically under a permissive license, but Varuna's own license requires verification.

Limitations & Caveats

  • Requires manual annotation of models with CutPoint instances.
  • The setup process for Apex patching can be fragile and version-dependent.
  • Job morphing requires user-implemented signal handlers in training scripts.
  • Auto-configuration requires a separate profiling step.
Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake) and Zhiqiang Xie Zhiqiang Xie(Author of SGLang).

veScale by volcengine

0.1%
839
PyTorch-native framework for LLM training
created 1 year ago
updated 3 weeks ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alex Cheema Alex Cheema(Cofounder of EXO Labs), and
1 more.

recurrent-pretraining by seal-rg

0.1%
806
Pretraining code for depth-recurrent language model research
created 5 months ago
updated 2 weeks ago
Starred by Yang Song Yang Song(Professor at Caltech; Research Scientist at OpenAI), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
4 more.

PiPPy by pytorch

0.1%
775
PyTorch tool for pipeline parallelism
created 3 years ago
updated 11 months ago
Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake) and Travis Fischer Travis Fischer(Founder of Agentic).

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
created 9 months ago
updated 2 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Zhuohan Li Zhuohan Li(Author of vLLM), and
6 more.

torchtitan by pytorch

0.9%
4k
PyTorch platform for generative AI model training research
created 1 year ago
updated 1 day ago
Feedback? Help us improve.