art-msra  by microsoft

Anonymous Region Transformer for multi-layer image generation

Created 1 year ago
365 stars

Top 77.1% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides official code for ART (Anonymous Region Transformer), a method for generating multi-layer transparent images from a single global text prompt and an anonymous region layout. It targets researchers and artists interested in complex image composition and transparent media generation, offering a novel approach to layer-based image synthesis.

How It Works

ART utilizes a transformer architecture to process an anonymous region layout (bounding boxes without explicit layer descriptions) and a global text prompt. This approach avoids the need for per-layer captions, simplifying the input process. The system is designed for efficiency, outperforming full attention and spatial-temporal attention mechanisms, and supports the generation of over 50 layers.

Quick Start & Requirements

  • Multi-Layer Generation:
    • Install dependencies via pip3 install torch==2.4.0 torchvision==0.19.0 diffusers==0.31.0 transformers==4.44.0 accelerate==0.34.2 peft==0.12.0 datasets==2.20.0 wandb==0.17.7 einops==0.8.0 sentencepiece==0.2.0 mmengine==0.10.4 prodigyopt==1.0.
    • Requires PyTorch 2.4.0, Python 3.10, and Hugging Face CLI login.
    • Download multiple checkpoints from provided Google Drive links.
    • Run inference with python example.py or python multi_layer_gen/test.py with specified arguments.
    • Official quick-start and testing scripts are available.
  • LLM For Layout Planning:
    • Requires Python 3.10.
    • Install dependencies via pip install -r requirements_part1.txt and requirements_part2.txt within the layout_planner directory.
    • May require ffmpeg, libsm6, libxext6 and potentially flash-attn-2.
    • Inference requires downloading base Llama 3 8B and a layout planner checkpoint.
    • Configuration is done via scripts/inference_template.sh.

Highlighted Details

  • Supports generation of 50+ image layers from a single prompt.
  • Eliminates the need for per-layer captions by using anonymous region layouts.
  • Achieves high efficiency compared to other attention mechanisms.
  • Includes an LLM-based module for automatic layout planning.

Maintenance & Community

  • The repository is associated with CVPR 2025.
  • The training code is currently listed as TODO.

Licensing & Compatibility

  • The license is not explicitly stated in the README.

Limitations & Caveats

  • Training code is not yet released.
  • The license is not specified, which may impact commercial use.
Health Check
Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

RPG-DiffusionMaster by YangLing0818

0%
2k
Training-free paradigm for text-to-image generation/editing
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.