veScale  by volcengine

PyTorch-native framework for LLM training

created 1 year ago
839 stars

Top 43.3% on sourcepulse

GitHubView on GitHub
Project Summary

veScale is a PyTorch-native framework designed to simplify and accelerate large language model (LLM) training. It targets researchers and engineers by abstracting complex distributed training strategies, enabling near-zero code modification for users and automating parallelism for enhanced performance.

How It Works

veScale leverages PyTorch's native ecosystem, offering single-device semantics that automatically distribute and orchestrate model execution across a cluster. It aims to provide automatic parallelism planning, combining strategies like tensor, sequence, data, ZeRO, and pipeline parallelism. The framework also supports both eager and compile modes for training and inference, with automatic management and resharding of distributed checkpoints across varying cluster sizes and parallelism configurations.

Quick Start & Requirements

  • Install: pip install vescale
  • Prerequisites: PyTorch. Specific hardware requirements (e.g., GPUs, CUDA versions) are not detailed in the README but are implied for LLM training.
  • Resources: LLM training typically requires significant GPU memory and compute.
  • Links: veScale GitHub Repository

Highlighted Details

  • Open-sourced pipeline parallelism API, graph parser, stage abstraction, schedules, and execution runtime.
  • Fast checkpointing system with automatic resharding, caching, load-balancing, and asynchronous I/O.
  • Examples for Mixtral, Llama2, and nanoGPT demonstrating bit-wise correctness.
  • Presented at MLSys 2024 and NSDI 2024.

Maintenance & Community

The project is under active development, with recent open-sourcing of key components. Future plans include high-level and power-user APIs for nD parallel training. Hiring notices are present.

Licensing & Compatibility

  • License: Apache License v2.0.
  • Compatibility: Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

veScale is described as being in its early phase, with ongoing refactoring to meet open-source standards. Full automation for parallelism planning and compile-mode support are listed as "coming soon."

Health Check
Last commit

3 weeks ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
1
Star History
45 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake) and Travis Fischer Travis Fischer(Founder of Agentic).

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
created 9 months ago
updated 2 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Zhuohan Li Zhuohan Li(Author of vLLM), and
6 more.

torchtitan by pytorch

0.9%
4k
PyTorch platform for generative AI model training research
created 1 year ago
updated 22 hours ago
Starred by Lewis Tunstall Lewis Tunstall(Researcher at Hugging Face), Robert Nishihara Robert Nishihara(Cofounder of Anyscale; Author of Ray), and
4 more.

verl by volcengine

2.4%
12k
RL training library for LLMs
created 9 months ago
updated 14 hours ago
Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Anton Bukov Anton Bukov(Cofounder of 1inch Network), and
16 more.

tinygrad by tinygrad

0.1%
30k
Minimalist deep learning framework for education and exploration
created 4 years ago
updated 18 hours ago
Feedback? Help us improve.