art-msra by microsoft

Anonymous Region Transformer for multi-layer image generation

Created 1 year ago

362 stars

Top 77.6% on SourcePulse

Project Summary

This repository provides official code for ART (Anonymous Region Transformer), a method for generating multi-layer transparent images from a single global text prompt and an anonymous region layout. It targets researchers and artists interested in complex image composition and transparent media generation, offering a novel approach to layer-based image synthesis.

How It Works

ART utilizes a transformer architecture to process an anonymous region layout (bounding boxes without explicit layer descriptions) and a global text prompt. This approach avoids the need for per-layer captions, simplifying the input process. The system is designed for efficiency, outperforming full attention and spatial-temporal attention mechanisms, and supports the generation of over 50 layers.

Quick Start & Requirements

Multi-Layer Generation:
- Install dependencies via pip3 install torch==2.4.0 torchvision==0.19.0 diffusers==0.31.0 transformers==4.44.0 accelerate==0.34.2 peft==0.12.0 datasets==2.20.0 wandb==0.17.7 einops==0.8.0 sentencepiece==0.2.0 mmengine==0.10.4 prodigyopt==1.0.
- Requires PyTorch 2.4.0, Python 3.10, and Hugging Face CLI login.
- Download multiple checkpoints from provided Google Drive links.
- Run inference with python example.py or python multi_layer_gen/test.py with specified arguments.
- Official quick-start and testing scripts are available.
LLM For Layout Planning:
- Requires Python 3.10.
- Install dependencies via pip install -r requirements_part1.txt and requirements_part2.txt within the layout_planner directory.
- May require ffmpeg, libsm6, libxext6 and potentially flash-attn-2.
- Inference requires downloading base Llama 3 8B and a layout planner checkpoint.
- Configuration is done via scripts/inference_template.sh.

Highlighted Details

Supports generation of 50+ image layers from a single prompt.
Eliminates the need for per-layer captions by using anonymous region layouts.
Achieves high efficiency compared to other attention mechanisms.
Includes an LLM-based module for automatic layout planning.

Maintenance & Community

The repository is associated with CVPR 2025.
The training code is currently listed as TODO.

Licensing & Compatibility

The license is not explicitly stated in the README.

Limitations & Caveats

Training code is not yet released.
The license is not specified, which may impact commercial use.

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

8 stars in the last 30 days

Explore Similar Projects

sd-forge-couple by Haoming02

Forge extension for region-specific conditioning in SD/SDXL

Created 1 year ago

Updated 2 weeks ago

LLMGA by JIA-Lab-research

Multimodal LLM for image generation/editing, leveraging LLMs for detailed prompts

Created 2 years ago

Updated 7 months ago

UltraPixel by catcathh

Research paper implementation for ultra-high-resolution image synthesis

Created 1 year ago

Updated 1 year ago

SemanticStyleGAN by seasonSH

Image synthesis research paper (CVPR 2022)

Created 3 years ago

Updated 3 years ago

Starred by

Robin Huang

Robin Huang(Cofounder of Comfy Org) and

Yoland Yan

Yoland Yan(Cofounder of Comfy Org).

ComfyUI_omost by huchenlei

ComfyUI nodes for regional prompt-driven image generation

Created 1 year ago

Updated 10 months ago

f-lite by fal-ai

Diffusion model for image generation, trained on copyright-safe content

Created 8 months ago

Updated 4 months ago

Starred by

Taranjeet Singh

Taranjeet Singh(Cofounder of Mem0) and

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

Lumina-mGPT-2.0 by Alpha-VLLM

Image generation model for broad tasks

Created 9 months ago

Updated 2 months ago

BLIP3o by JiuhaiChen

Unified multimodal model combining reasoning with generative diffusion

Created 8 months ago

Updated 1 month ago

UNO by bytedance

Subject-to-image model for single/multi-subject customization

Created 9 months ago

Updated 4 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

RPG-DiffusionMaster by YangLing0818

Training-free paradigm for text-to-image generation/editing

Created 2 years ago

Updated 11 months ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen),

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind), and

1 more.

GLIGEN by gligen

Text-to-image generation research paper using grounded prompts

Created 3 years ago

Updated 1 year ago

Starred by

Aravind Srinivas

Aravind Srinivas(Cofounder of Perplexity),

Steve Sewell

Steve Sewell(Founder of Builder.io), and

5 more.

image-gpt by openai

Image generation research paper, code, and models

Created 5 years ago

Updated 3 years ago

Feedback? Help us improve.