torchshard  by kaiyuyue

PyTorch engine for tensor slicing into parallel shards

created 4 years ago
299 stars

Top 90.0% on sourcepulse

GitHubView on GitHub
Project Summary

TorchShard is a PyTorch extension designed to enable efficient training of large neural networks by sharding tensors across multiple GPUs. It targets researchers and engineers working with models that have massive linear layers or a very large number of classes, offering a way to reduce GPU memory consumption and scale training.

How It Works

TorchShard implements tensor parallelism by slicing PyTorch tensors along specified dimensions. It provides drop-in replacements for torch.nn.Linear (as ts.nn.ParallelLinear) and integrates with PyTorch's distributed primitives. This approach allows for parallel computation of linear layers and loss functions, distributing the memory and compute load across available GPUs. The API is designed to be consistent with PyTorch, minimizing the learning curve for users.

Quick Start & Requirements

  • Primary install: pip install torchshard
  • Prerequisites: PyTorch, distributed environment setup (e.g., torch.distributed.init_process_group).
  • Links: Documents, INSTALL.md

Highlighted Details

  • Enables scaling models with millions of classes or massive linear layers.
  • Offers parallel implementations for nn.Linear and loss functions.
  • Supports sharding along row (dim=0) or column (dim=1) dimensions.
  • Provides utilities for collecting sharded model states.

Maintenance & Community

The project is primarily maintained by Kaiyu Yue. Contributions are welcomed via pull requests. Contact email is provided for inquiries.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires further investigation for commercial use or closed-source integration.

Limitations & Caveats

The README does not specify compatibility with older PyTorch versions or other deep learning frameworks. The performance figures are based on specific hardware (NVIDIA TITAN-XP) and may vary on different GPU architectures.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Zhuohan Li Zhuohan Li(Author of vLLM), and
6 more.

torchtitan by pytorch

0.9%
4k
PyTorch platform for generative AI model training research
created 1 year ago
updated 1 day ago
Feedback? Help us improve.