art-msra  by microsoft

Anonymous Region Transformer for multi-layer image generation

Created 9 months ago
341 stars

Top 81.0% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides official code for ART (Anonymous Region Transformer), a method for generating multi-layer transparent images from a single global text prompt and an anonymous region layout. It targets researchers and artists interested in complex image composition and transparent media generation, offering a novel approach to layer-based image synthesis.

How It Works

ART utilizes a transformer architecture to process an anonymous region layout (bounding boxes without explicit layer descriptions) and a global text prompt. This approach avoids the need for per-layer captions, simplifying the input process. The system is designed for efficiency, outperforming full attention and spatial-temporal attention mechanisms, and supports the generation of over 50 layers.

Quick Start & Requirements

  • Multi-Layer Generation:
    • Install dependencies via pip3 install torch==2.4.0 torchvision==0.19.0 diffusers==0.31.0 transformers==4.44.0 accelerate==0.34.2 peft==0.12.0 datasets==2.20.0 wandb==0.17.7 einops==0.8.0 sentencepiece==0.2.0 mmengine==0.10.4 prodigyopt==1.0.
    • Requires PyTorch 2.4.0, Python 3.10, and Hugging Face CLI login.
    • Download multiple checkpoints from provided Google Drive links.
    • Run inference with python example.py or python multi_layer_gen/test.py with specified arguments.
    • Official quick-start and testing scripts are available.
  • LLM For Layout Planning:
    • Requires Python 3.10.
    • Install dependencies via pip install -r requirements_part1.txt and requirements_part2.txt within the layout_planner directory.
    • May require ffmpeg, libsm6, libxext6 and potentially flash-attn-2.
    • Inference requires downloading base Llama 3 8B and a layout planner checkpoint.
    • Configuration is done via scripts/inference_template.sh.

Highlighted Details

  • Supports generation of 50+ image layers from a single prompt.
  • Eliminates the need for per-layer captions by using anonymous region layouts.
  • Achieves high efficiency compared to other attention mechanisms.
  • Includes an LLM-based module for automatic layout planning.

Maintenance & Community

  • The repository is associated with CVPR 2025.
  • The training code is currently listed as TODO.

Licensing & Compatibility

  • The license is not explicitly stated in the README.

Limitations & Caveats

  • Training code is not yet released.
  • The license is not specified, which may impact commercial use.
Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.