tensor_parallel by BlackSamorez

PyTorch module for multi-GPU model parallelism

Created 3 years ago

657 stars

Top 51.1% on SourcePulse

View on GitHub

4 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Wing Lian

Founder of Axolotl AI

Amanpreet Singh

Cofounder of Contextual AI

Alexander Borzunov

Research Scientist at OpenAI

Project Summary

This library enables users to effortlessly distribute PyTorch models across multiple GPUs for training and inference with minimal code changes. It's designed for researchers and practitioners working with large language models that exceed single-GPU memory capacity, offering a straightforward solution for scaling.

How It Works

The core of the library is the tp.tensor_parallel function, which automatically partitions model weights across specified GPUs. It implements tensor parallelism by splitting individual layer weights, performing computations on each GPU, and then synchronizing the results. This approach allows for linear speedups and memory savings by distributing the model's footprint. The library also supports ZeRO-3 sharding for trainable parameters not covered by tensor parallelism, further optimizing memory usage during training.

Quick Start & Requirements

Install: pip install tensor_parallel
Requirements: PyTorch, transformers. Multi-GPU setup recommended.
Demo: Kaggle notebook available for a 40B LLM.

Highlighted Details

Enables running large PyTorch models on multiple GPUs with a single line of code.
Supports both training and inference.
Offers memory-efficient dispatch for loading models using convert_state_dict and accelerate.
Includes a save_tensor_parallel context manager for saving models to a non-parallel format.

Maintenance & Community

The project is actively maintained by BlackSamorez. Users can report bugs and issues via the GitHub issue tracker.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The library is primarily designed for quick prototyping on a single machine. For large-scale, multi-node training, more complex solutions like DeepSpeed or Megatron-LM are recommended. Debugging NCCL errors may require setting TENSOR_PARALLEL_USE_NATIVE=1.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days