RPG-DiffusionMaster by YangLing0818

Training-free paradigm for text-to-image generation/editing

Created 2 years ago

1,840 stars

Top 23.3% on SourcePulse

View on GitHub

2 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

This repository provides the official implementation for RPG, a training-free paradigm for advanced text-to-image generation and editing. It leverages multimodal large language models (MLLMs) for prompt recaptioning and regional planning, combined with regional diffusion techniques, to achieve state-of-the-art results, particularly for complex compositional prompts. The framework is designed for researchers and practitioners in AI image generation seeking enhanced control and fidelity.

How It Works

RPG integrates MLLMs (like GPT-4, Gemini-Pro, or local models such as miniGPT-4) to break down complex text prompts into regional descriptions and spatial layouts. This structured input is then fed into a complementary regional diffusion model, allowing for precise control over different image areas. This approach enables the generation of images with high resolution and intricate details, overcoming limitations of standard text-to-image models in handling complex spatial relationships and multiple object attributes.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment (conda create -n RPG python==3.9), activate it (conda activate RPG), and install dependencies (pip install -r requirements.txt).
Prerequisites: Requires Python 3.9+, PyTorch, and Hugging Face's diffusers library. For optimal performance, NVIDIA GPUs with at least 10GB VRAM are recommended, especially when using powerful MLLMs like GPT-4. Local MLLMs may require more VRAM.
Models: Download diffusion models (SDXL, SDXL-Turbo, Playground v2, CIVITA, AlbedoBase XL, DreamShaper XL, SD v1.5, SD v2.1) and MLLMs (GPT-4, Gemini-Pro, miniGPT-4, Llama2-13b-chat, Llama2-70b-chat).
Usage: Refer to RPG.py and example notebooks for detailed usage with GPT-4, local LLMs, and different diffusion pipelines (RegionalDiffusionPipeline for SD v1.x/v2.x, RegionalDiffusionXLPipeline for SDXL).
Links: Official Implementation, Hugging Face Spaces, Example Notebook

Highlighted Details

Supports generation of high-resolution images (e.g., 2048x1024).
Compatible with various diffusion backbones and MLLM architectures.
Enhancements include integration with advanced MLLMs (DeepSeek-R1, o3-mini, o1) and diffusion backbones (IterComp).
Offers ControlNet integration for Open Pose and Depth Map conditioning.

Maintenance & Community

The project is associated with ICML 2024 and acknowledges contributions from AUTOMATIC1111, regional-prompter, SAM, and diffusers. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository's licensing is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the underlying diffusion model licenses and any specific terms associated with the RPG framework itself.

Limitations & Caveats

The README suggests that using local LLMs can increase load times and VRAM usage. Achieving satisfactory results depends on proper configuration of base_prompt and base_ratio parameters, with guidance provided in the paper and examples.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days