varuna by microsoft

Tool for efficient large DNN model training on commodity hardware

Created 4 years ago

252 stars

Top 99.6% on SourcePulse

View on GitHub

3 Experts Love This Project

Vincent Weisser

Cofounder of Prime Intellect

Wing Lian

Founder of Axolotl AI

Amanpreet Singh

Cofounder of Contextual AI

Project Summary

Varuna is a PyTorch library designed for efficient, scalable, and cost-effective training of large deep learning models on commodity hardware. It targets researchers and practitioners working with massive models that exceed the memory capacity of single GPUs, offering a solution that combines pipeline and data parallelism with dynamic resource adaptation.

How It Works

Varuna implements a hybrid parallelism strategy, interleaving pipeline parallelism (PP) and data parallelism (DP). Models are partitioned into sequential stages using CutPoint annotations within the model definition. These stages are then distributed across available GPUs. Data parallelism is applied across replicas of this pipeline. This approach allows for efficient utilization of memory and compute by breaking down large models and distributing them, while the hybrid nature aims to balance communication and computation overheads.

Quick Start & Requirements

Installation: Requires Python 3, PyTorch (1.5+), and Apex. Apex must be patched using the provided apex.patch before building.

git clone https://github.com/NVIDIA/apex
cp apex.patch /path/to/apex/
cd /path/to/apex
git apply apex.patch
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Then, install Varuna:

git clone <varuna_repo>
cd varuna
python setup.py install

Prerequisites: PyTorch, Apex (with patch), Python 3.
Launch: Use run_varuna.py for distributed execution.
Docs: Available in the docs/ folder (html/index.html, varuna.pdf). Examples for BERT and Megatron-LM are in examples/.

Highlighted Details

Implements pipeline parallelism and data parallelism for large model training.
Supports dynamic resource scaling ("job morphing") via signal handling for checkpointing and relaunching.
Includes an auto-configuration module that profiles model/network performance to suggest optimal parallelism settings.
Handles FP16 mixed-precision training and parameter sharing across stages.

Maintenance & Community

Based on the paper "Varuna: Scalable, Low-cost Training of Massive Deep Learning Models" (EuroSys'22).
No explicit community links (Discord/Slack) or active contributor information provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. The dependency Apex is typically under a permissive license, but Varuna's own license requires verification.

Limitations & Caveats