PixArt-alpha  by PixArt-alpha

Fast text-to-image synthesis with Diffusion Transformers

created 1 year ago
3,141 stars

Top 15.7% on sourcepulse

GitHubView on GitHub
Project Summary

PixArt-α is a PyTorch implementation of a Diffusion Transformer (DiT) model for photorealistic text-to-image synthesis, designed for significantly faster training and competitive generation quality compared to existing large-scale models. It targets researchers and developers in the AI-generated content (AIGC) community seeking to build high-quality, low-cost generative models.

How It Works

PixArt-α employs a Transformer architecture for diffusion models, incorporating cross-attention to inject text conditioning efficiently. Its training strategy is decomposed into three distinct steps: optimizing pixel dependency, text-image alignment, and image aesthetic quality. The model leverages high-informative data, specifically dense pseudo-captions generated by a large Vision-Language model, to enhance text-image alignment. This approach results in a 0.6B parameter model that achieves competitive FID scores with substantially reduced training time and cost.

Quick Start & Requirements

  • Installation: Clone the repository and install requirements:
    git clone https://github.com/PixArt-alpha/PixArt-alpha.git
    cd PixArt-alpha
    pip install -r requirements.txt
    
  • Prerequisites: Python >= 3.9, PyTorch >= 1.13.0+cu11.7. CUDA is required for GPU acceleration.
  • Models: Pre-trained weights are available for download.
  • Resources: Inference requires at least 23GB VRAM with this repo, but the diffusers integration supports as low as 8GB.
  • Demos: Gradio app available (python app/app.py), Docker support, and Hugging Face/Google Colab demos are provided.
  • Documentation: Detailed inference speed and code guidance are available in docs.

Highlighted Details

  • Achieves competitive image quality with models like Imagen and SDXL, while training 90% faster than Stable Diffusion v1.5 (675 vs. 6,250 A100 GPU days).
  • Supports high-resolution image synthesis up to 1024px.
  • PixArt-δ variant offers fast inference (0.5s on A100) and low VRAM usage (<8GB) with LCM integration.
  • Includes support for ControlNet and Dreambooth fine-tuning.

Maintenance & Community

The project is actively developed with recent updates including PixArt-δ (LCM and ControlNet) releases and diffusers integration. A Discord community is available for discussions and contributions.

Licensing & Compatibility

The repository's primary license is not explicitly stated in the README. However, the integration with Hugging Face diffusers suggests compatibility with its ecosystem. Specific model weights may have different licenses.

Limitations & Caveats

The base repository's inference requires significant GPU memory (23GB+). While the diffusers integration addresses lower VRAM requirements (8GB), users should verify specific model compatibility. The project also includes experimental features and ongoing development for newer versions like PixArt-Σ.

Health Check
Last commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
87 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
12 more.

stablediffusion by Stability-AI

0.1%
41k
Latent diffusion model for high-resolution image synthesis
created 2 years ago
updated 1 month ago
Feedback? Help us improve.