PatrickStar  by Tencent

Parallel training framework for large language models

created 4 years ago
763 stars

Top 46.5% on sourcepulse

GitHubView on GitHub
Project Summary

PatrickStar addresses the prohibitive hardware requirements for training large-scale pre-trained language models (PTMs). It offers a solution for researchers and engineers to train larger models with fewer GPUs by efficiently utilizing both CPU and GPU memory.

How It Works

PatrickStar employs a dynamic, chunk-based memory management system for heterogeneous training. Unlike static approaches, it dynamically offloads model components not currently in use to CPU memory, maximizing GPU utilization. This chunk-based approach also optimizes collective communication for multi-GPU scaling.

Quick Start & Requirements

  • Install via pip install .
  • Requires gcc version 7 or higher.
  • Tested NVIDIA NGC image: nvcr.io/nvidia/pytorch:21.06-py3
  • Official quick-start and examples are available.

Highlighted Details

  • Enables training of an 18B parameter model on 8x V100 GPUs with 240GB total GPU memory.
  • Achieves training of a 68B model on 8x A100 GPUs with 1TB CPU memory.
  • Successfully trained a GPT3-175B model on 32 GPUs.
  • Offers performance improvements over DeepSpeed for models of similar sizes.

Maintenance & Community

  • Developed by the WeChat AI Team, Tencent NLP Oteam.
  • Contact: {jiaruifang, zilinzhu, josephyu}@tencent.com

Licensing & Compatibility

  • BSD 3-Clause License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project's README mentions specific versions of DeepSpeed and PyTorch for benchmarks, implying potential compatibility considerations with newer versions. The primary installation method is from source, which may require more effort than pre-built packages.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Feedback? Help us improve.