PixArt-alpha by PixArt-alpha

Fast text-to-image synthesis with Diffusion Transformers

Created 2 years ago

3,253 stars

Top 14.7% on SourcePulse

View on GitHub

2 Experts Love This Project

Saining Xie

Professor at NYU

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

Project Summary

PixArt-α is a PyTorch implementation of a Diffusion Transformer (DiT) model for photorealistic text-to-image synthesis, designed for significantly faster training and competitive generation quality compared to existing large-scale models. It targets researchers and developers in the AI-generated content (AIGC) community seeking to build high-quality, low-cost generative models.

How It Works

PixArt-α employs a Transformer architecture for diffusion models, incorporating cross-attention to inject text conditioning efficiently. Its training strategy is decomposed into three distinct steps: optimizing pixel dependency, text-image alignment, and image aesthetic quality. The model leverages high-informative data, specifically dense pseudo-captions generated by a large Vision-Language model, to enhance text-image alignment. This approach results in a 0.6B parameter model that achieves competitive FID scores with substantially reduced training time and cost.

Quick Start & Requirements

Installation: Clone the repository and install requirements:

git clone https://github.com/PixArt-alpha/PixArt-alpha.git
cd PixArt-alpha
pip install -r requirements.txt

Prerequisites: Python >= 3.9, PyTorch >= 1.13.0+cu11.7. CUDA is required for GPU acceleration.
Models: Pre-trained weights are available for download.
Resources: Inference requires at least 23GB VRAM with this repo, but the diffusers integration supports as low as 8GB.
Demos: Gradio app available (python app/app.py), Docker support, and Hugging Face/Google Colab demos are provided.
Documentation: Detailed inference speed and code guidance are available in docs.

Highlighted Details

Achieves competitive image quality with models like Imagen and SDXL, while training 90% faster than Stable Diffusion v1.5 (675 vs. 6,250 A100 GPU days).
Supports high-resolution image synthesis up to 1024px.
PixArt-δ variant offers fast inference (0.5s on A100) and low VRAM usage (<8GB) with LCM integration.
Includes support for ControlNet and Dreambooth fine-tuning.

Maintenance & Community

The project is actively developed with recent updates including PixArt-δ (LCM and ControlNet) releases and diffusers integration. A Discord community is available for discussions and contributions.

Licensing & Compatibility

The repository's primary license is not explicitly stated in the README. However, the integration with Hugging Face diffusers suggests compatibility with its ecosystem. Specific model weights may have different licenses.

Limitations & Caveats

The base repository's inference requires significant GPU memory (23GB+). While the diffusers integration addresses lower VRAM requirements (8GB), users should verify specific model compatibility. The project also includes experimental features and ongoing development for newer versions like PixArt-Σ.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days