xDiT  by xdit-project

Inference engine for parallel Diffusion Transformer (DiT) deployment

created 1 year ago
2,161 stars

Top 21.3% on sourcepulse

GitHubView on GitHub
Project Summary

xDiT is a scalable inference engine designed to accelerate Diffusion Transformer (DiT) models for image and video generation. It addresses the quadratic complexity of attention mechanisms in DiTs, enabling efficient deployment across multiple GPUs and machines for real-time applications. The engine targets researchers and developers working with large-scale DiT models, offering significant performance gains through advanced parallelism and single-GPU acceleration techniques.

How It Works

xDiT employs a hybrid parallelism strategy, combining techniques like Unified Sequence Parallelism (USP), PipeFusion (sequence-level pipeline parallelism), CFG Parallel, and Data Parallel. USP is a novel approach that unifies DeepSpeed-Ulysses and Ring-Attention for efficient sequence parallelism. PipeFusion leverages temporal redundancy in diffusion models for pipeline parallelism. These methods can be hybridized, with the product of parallel degrees matching the total number of devices. Additionally, xDiT incorporates single-GPU acceleration through kernel optimizations, compilation acceleration (torch.compile, onediff), and cache acceleration (TeaCache, First-Block-Cache, DiTFastAttn) to exploit computational redundancies.

Quick Start & Requirements

  • Installation: pip install xfuser or pip install "xfuser[diffusers,flash-attn]" for optional dependencies. Install from source with pip install -e . or pip install -e ".[diffusers,flash-attn]". Docker image available: thufeifeibear/xdit-dev.
  • Prerequisites: flash-attn (>= 2.6.0 recommended for optimal GPU performance, fallback available for NPU compatibility). diffusers is optional but recommended for many models.
  • Usage: Examples provided in ./examples/. Run with bash examples/run.sh. Hybrid parallelism requires careful configuration of degrees (e.g., ulysses_degree * pipefusion_parallel_degree * cfg_degree == num_devices).
  • Links: Papers, Quick Start, Supported DiTs, Dev Guide, Discussion.

Highlighted Details

  • Supports a wide range of DiT models including StepVideo, HunyuanVideo, PixArt-Sigma, and Stable Diffusion 3.
  • Pioneers USP and PipeFusion for efficient sequence and pipeline parallelism, respectively.
  • Offers hybrid parallelism to combine multiple strategies for optimal scaling.
  • Includes single-GPU acceleration via compilation (torch.compile, onediff) and cache methods.

Maintenance & Community

  • Active development with a recent major API upgrade in August 2024.
  • Community Discord server available: https://discord.gg/YEWzWfCF9S.
  • Actively seeking contributions for new features and models.

Licensing & Compatibility

  • The primary license is not explicitly stated in the README. However, the project cites multiple research papers, suggesting a research-oriented focus. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

  • Legacy APIs are outdated and do not support hybrid parallelism; users are strongly encouraged to use the new APIs.
  • Cache acceleration methods are currently only supported for the FLUX model with USP and not for PipeFusion.
  • Specific diffusers versions may be required for certain models, necessitating potential version management.
Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
5
Issues (30d)
8
Star History
272 stars in the last 90 days

Explore Similar Projects

Starred by Tri Dao Tri Dao(Chief Scientist at Together AI), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
1 more.

oslo by tunib-ai

0%
309
Framework for large-scale transformer optimization
created 3 years ago
updated 2 years ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Julien Chaumond Julien Chaumond(Cofounder of Hugging Face), and
1 more.

parallelformers by tunib-ai

0%
790
Toolkit for easy model parallelization
created 4 years ago
updated 2 years ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
6 more.

gpt-neox by EleutherAI

0.1%
7k
Framework for training large-scale autoregressive language models
created 4 years ago
updated 1 week ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
12 more.

DeepSpeed by deepspeedai

0.2%
40k
Deep learning optimization library for distributed training and inference
created 5 years ago
updated 1 day ago
Feedback? Help us improve.