xDiT by xdit-project

Inference engine for parallel Diffusion Transformer (DiT) deployment

Created 1 year ago

2,498 stars

Top 18.4% on SourcePulse

View on GitHub

2 Experts Love This Project

Alex Yu

Research Scientist at OpenAI; Cofounder of Luma AI

Cody Yu

Coauthor of vLLM; MTS at OpenAI

Project Summary

xDiT is a scalable inference engine designed to accelerate Diffusion Transformer (DiT) models for image and video generation. It addresses the quadratic complexity of attention mechanisms in DiTs, enabling efficient deployment across multiple GPUs and machines for real-time applications. The engine targets researchers and developers working with large-scale DiT models, offering significant performance gains through advanced parallelism and single-GPU acceleration techniques.

How It Works

xDiT employs a hybrid parallelism strategy, combining techniques like Unified Sequence Parallelism (USP), PipeFusion (sequence-level pipeline parallelism), CFG Parallel, and Data Parallel. USP is a novel approach that unifies DeepSpeed-Ulysses and Ring-Attention for efficient sequence parallelism. PipeFusion leverages temporal redundancy in diffusion models for pipeline parallelism. These methods can be hybridized, with the product of parallel degrees matching the total number of devices. Additionally, xDiT incorporates single-GPU acceleration through kernel optimizations, compilation acceleration (torch.compile, onediff), and cache acceleration (TeaCache, First-Block-Cache, DiTFastAttn) to exploit computational redundancies.

Quick Start & Requirements

Installation: pip install xfuser or pip install "xfuser[diffusers,flash-attn]" for optional dependencies. Install from source with pip install -e . or pip install -e ".[diffusers,flash-attn]". Docker image available: thufeifeibear/xdit-dev.
Prerequisites: flash-attn (>= 2.6.0 recommended for optimal GPU performance, fallback available for NPU compatibility). diffusers is optional but recommended for many models.
Usage: Examples provided in ./examples/. Run with bash examples/run.sh. Hybrid parallelism requires careful configuration of degrees (e.g., ulysses_degree * pipefusion_parallel_degree * cfg_degree == num_devices).
Links: Papers, Quick Start, Supported DiTs, Dev Guide, Discussion.

Highlighted Details

Supports a wide range of DiT models including StepVideo, HunyuanVideo, PixArt-Sigma, and Stable Diffusion 3.
Pioneers USP and PipeFusion for efficient sequence and pipeline parallelism, respectively.
Offers hybrid parallelism to combine multiple strategies for optimal scaling.
Includes single-GPU acceleration via compilation (torch.compile, onediff) and cache methods.

Maintenance & Community

Active development with a recent major API upgrade in August 2024.
Community Discord server available: https://discord.gg/YEWzWfCF9S.
Actively seeking contributions for new features and models.

Licensing & Compatibility

The primary license is not explicitly stated in the README. However, the project cites multiple research papers, suggesting a research-oriented focus. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

Legacy APIs are outdated and do not support hybrid parallelism; users are strongly encouraged to use the new APIs.
Cache acceleration methods are currently only supported for the FLUX model with USP and not for PipeFusion.
Specific diffusers versions may be required for certain models, necessitating potential version management.

Health Check

Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

34 stars in the last 30 days