xla by pytorch

PyTorch on XLA devices

Created 7 years ago

2,739 stars

Top 17.2% on SourcePulse

View on GitHub

20 Experts Love This Project

Aravind Srinivas

Cofounder of Perplexity

Woosuk Kwon

Coauthor of vLLM

Jiayi Pan

Author of SWE-Gym; MTS at xAI

Cody Yu

Coauthor of vLLM; MTS at OpenAI

and 16 more!

Project Summary

This repository provides PyTorch/XLA, a Python package enabling PyTorch to run on XLA-accelerated hardware, primarily Google Cloud TPUs and NVIDIA GPUs. It targets researchers and engineers looking to leverage specialized hardware for faster deep learning model training and inference, offering significant performance gains over standard CPU or GPU setups.

How It Works

PyTorch/XLA integrates PyTorch with the XLA (Accelerated Linear Algebra) compiler. XLA optimizes PyTorch operations into efficient kernels for specific hardware backends. The library supports various execution modes, including single-process, multi-process, and SPMD (Single Program, Multiple Data), allowing flexible scaling across multiple accelerators. It employs lazy tensor evaluation and asynchronous execution to maximize hardware utilization.

Quick Start & Requirements

Installation: Use pip install torch==<version> 'torch_xla[tpu]==<version>' for stable builds on TPU VMs. Nightly builds and specific CUDA versions require direct wheel installation from provided GCS URLs.
Prerequisites: Google Cloud TPU VM or compatible GPU environment. Specific CUDA versions (e.g., 12.1, 12.6) are required for GPU builds. Python 3.8-3.11 are supported depending on the release.
Resources: Requires access to TPU or GPU hardware. Setup involves installing PyTorch and PyTorch/XLA wheels.
Documentation: PyTorch/XLA Docs

Highlighted Details

Offers C++11 ABI builds for improved lazy tensor tracing performance, showing up to 39% MFU on Mixtral 8x7B compared to 33% for pre-C++11 ABI.
Supports distributed training paradigms like DistributedDataParallel (DDP) and FullyShardedDataParallel (FSDP).
Provides comprehensive documentation on performance tuning, distributed execution, and specific features like Pallas and Triton integration.
Includes reference implementations for large models in the AI-Hypercomputer/tpu-recipes repository.

Maintenance & Community

Jointly operated by Google and Meta, with contributions from individual developers. Feedback and bug reports are encouraged via GitHub issues.

Licensing & Compatibility

The repository is open-source, with licensing details not explicitly stated in the README but generally aligned with PyTorch's permissive licensing for commercial use.

Limitations & Caveats

The README notes that as of release 2.7, only C++11 ABI builds are provided, which may impact compatibility with older pre-C++11 ABI setups. Specific Python and CUDA version compatibility must be carefully checked when selecting wheels or Docker images.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

24 stars in the last 30 days