tensor_parallel  by BlackSamorez

PyTorch module for multi-GPU model parallelism

Created 2 years ago
657 stars

Top 50.9% on SourcePulse

GitHubView on GitHub
Project Summary

This library enables users to effortlessly distribute PyTorch models across multiple GPUs for training and inference with minimal code changes. It's designed for researchers and practitioners working with large language models that exceed single-GPU memory capacity, offering a straightforward solution for scaling.

How It Works

The core of the library is the tp.tensor_parallel function, which automatically partitions model weights across specified GPUs. It implements tensor parallelism by splitting individual layer weights, performing computations on each GPU, and then synchronizing the results. This approach allows for linear speedups and memory savings by distributing the model's footprint. The library also supports ZeRO-3 sharding for trainable parameters not covered by tensor parallelism, further optimizing memory usage during training.

Quick Start & Requirements

  • Install: pip install tensor_parallel
  • Requirements: PyTorch, transformers. Multi-GPU setup recommended.
  • Demo: Kaggle notebook available for a 40B LLM.

Highlighted Details

  • Enables running large PyTorch models on multiple GPUs with a single line of code.
  • Supports both training and inference.
  • Offers memory-efficient dispatch for loading models using convert_state_dict and accelerate.
  • Includes a save_tensor_parallel context manager for saving models to a non-parallel format.

Maintenance & Community

The project is actively maintained by BlackSamorez. Users can report bugs and issues via the GitHub issue tracker.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The library is primarily designed for quick prototyping on a single machine. For large-scale, multi-node training, more complex solutions like DeepSpeed or Megatron-LM are recommended. Debugging NCCL errors may require setting TENSOR_PARALLEL_USE_NATIVE=1.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Amanpreet Singh Amanpreet Singh(Cofounder of Contextual AI) and Ross Taylor Ross Taylor(Cofounder of General Reasoning; Cocreator of Papers with Code).

torchshard by kaiyuyue

0%
300
PyTorch engine for tensor slicing into parallel shards
Created 4 years ago
Updated 3 months ago
Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
4 more.

parallelformers by tunib-ai

0%
790
Toolkit for easy model parallelization
Created 4 years ago
Updated 2 years ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
15 more.

FasterTransformer by NVIDIA

0.1%
6k
Optimized transformer library for inference
Created 4 years ago
Updated 1 year ago
Feedback? Help us improve.