veScale  by volcengine

PyTorch-native framework for LLM training

Created 1 year ago
865 stars

Top 41.5% on SourcePulse

GitHubView on GitHub
Project Summary

veScale is a PyTorch-native framework designed to simplify and accelerate large language model (LLM) training. It targets researchers and engineers by abstracting complex distributed training strategies, enabling near-zero code modification for users and automating parallelism for enhanced performance.

How It Works

veScale leverages PyTorch's native ecosystem, offering single-device semantics that automatically distribute and orchestrate model execution across a cluster. It aims to provide automatic parallelism planning, combining strategies like tensor, sequence, data, ZeRO, and pipeline parallelism. The framework also supports both eager and compile modes for training and inference, with automatic management and resharding of distributed checkpoints across varying cluster sizes and parallelism configurations.

Quick Start & Requirements

  • Install: pip install vescale
  • Prerequisites: PyTorch. Specific hardware requirements (e.g., GPUs, CUDA versions) are not detailed in the README but are implied for LLM training.
  • Resources: LLM training typically requires significant GPU memory and compute.
  • Links: veScale GitHub Repository

Highlighted Details

  • Open-sourced pipeline parallelism API, graph parser, stage abstraction, schedules, and execution runtime.
  • Fast checkpointing system with automatic resharding, caching, load-balancing, and asynchronous I/O.
  • Examples for Mixtral, Llama2, and nanoGPT demonstrating bit-wise correctness.
  • Presented at MLSys 2024 and NSDI 2024.

Maintenance & Community

The project is under active development, with recent open-sourcing of key components. Future plans include high-level and power-user APIs for nD parallel training. Hiring notices are present.

Licensing & Compatibility

  • License: Apache License v2.0.
  • Compatibility: Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

veScale is described as being in its early phase, with ongoing refactoring to meet open-source standards. Full automation for parallelism planning and compile-mode support are listed as "coming soon."

Health Check
Last Commit

6 days ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
0
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
1 more.

VeOmni by ByteDance-Seed

3.4%
1k
Framework for scaling multimodal model training across accelerators
Created 5 months ago
Updated 3 weeks ago
Starred by Yang Song Yang Song(Professor at Caltech; Research Scientist at OpenAI), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
6 more.

PiPPy by pytorch

0%
779
PyTorch tool for pipeline parallelism
Created 3 years ago
Updated 1 year ago
Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
6 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
Created 11 months ago
Updated 2 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), and
13 more.

torchtitan by pytorch

0.7%
4k
PyTorch platform for generative AI model training research
Created 1 year ago
Updated 21 hours ago
Feedback? Help us improve.