veScale by volcengine

PyTorch-native framework for LLM training

Created 1 year ago

917 stars

Top 39.7% on SourcePulse

View on GitHub

6 Experts Love This Project

Junyang Lin

Core Maintainer at Alibaba Qwen

Casper Hansen

Author of AutoAWQ

Yaowei Zheng

Author of LLaMA-Factory

Stas Bekman

Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake

and 2 more!

Project Summary

veScale is a PyTorch-native framework designed to simplify and accelerate large language model (LLM) training. It targets researchers and engineers by abstracting complex distributed training strategies, enabling near-zero code modification for users and automating parallelism for enhanced performance.

How It Works

veScale leverages PyTorch's native ecosystem, offering single-device semantics that automatically distribute and orchestrate model execution across a cluster. It aims to provide automatic parallelism planning, combining strategies like tensor, sequence, data, ZeRO, and pipeline parallelism. The framework also supports both eager and compile modes for training and inference, with automatic management and resharding of distributed checkpoints across varying cluster sizes and parallelism configurations.

Quick Start & Requirements

Install: pip install vescale
Prerequisites: PyTorch. Specific hardware requirements (e.g., GPUs, CUDA versions) are not detailed in the README but are implied for LLM training.
Resources: LLM training typically requires significant GPU memory and compute.
Links: veScale GitHub Repository

Highlighted Details

Open-sourced pipeline parallelism API, graph parser, stage abstraction, schedules, and execution runtime.
Fast checkpointing system with automatic resharding, caching, load-balancing, and asynchronous I/O.
Examples for Mixtral, Llama2, and nanoGPT demonstrating bit-wise correctness.
Presented at MLSys 2024 and NSDI 2024.

Maintenance & Community

The project is under active development, with recent open-sourcing of key components. Future plans include high-level and power-user APIs for nD parallel training. Hiring notices are present.

Licensing & Compatibility

License: Apache License v2.0.
Compatibility: Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

veScale is described as being in its early phase, with ongoing refactoring to meet open-source standards. Full automation for parallelism planning and compile-mode support are listed as "coming soon."

Health Check

Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

18 stars in the last 30 days