PyTorch-native framework for LLM training
Top 43.3% on sourcepulse
veScale is a PyTorch-native framework designed to simplify and accelerate large language model (LLM) training. It targets researchers and engineers by abstracting complex distributed training strategies, enabling near-zero code modification for users and automating parallelism for enhanced performance.
How It Works
veScale leverages PyTorch's native ecosystem, offering single-device semantics that automatically distribute and orchestrate model execution across a cluster. It aims to provide automatic parallelism planning, combining strategies like tensor, sequence, data, ZeRO, and pipeline parallelism. The framework also supports both eager and compile modes for training and inference, with automatic management and resharding of distributed checkpoints across varying cluster sizes and parallelism configurations.
Quick Start & Requirements
pip install vescale
Highlighted Details
Maintenance & Community
The project is under active development, with recent open-sourcing of key components. Future plans include high-level and power-user APIs for nD parallel training. Hiring notices are present.
Licensing & Compatibility
Limitations & Caveats
veScale is described as being in its early phase, with ongoing refactoring to meet open-source standards. Full automation for parallelism planning and compile-mode support are listed as "coming soon."
3 weeks ago
1 week