RPG-DiffusionMaster  by YangLing0818

Training-free paradigm for text-to-image generation/editing

created 1 year ago
1,819 stars

Top 24.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the official implementation for RPG, a training-free paradigm for advanced text-to-image generation and editing. It leverages multimodal large language models (MLLMs) for prompt recaptioning and regional planning, combined with regional diffusion techniques, to achieve state-of-the-art results, particularly for complex compositional prompts. The framework is designed for researchers and practitioners in AI image generation seeking enhanced control and fidelity.

How It Works

RPG integrates MLLMs (like GPT-4, Gemini-Pro, or local models such as miniGPT-4) to break down complex text prompts into regional descriptions and spatial layouts. This structured input is then fed into a complementary regional diffusion model, allowing for precise control over different image areas. This approach enables the generation of images with high resolution and intricate details, overcoming limitations of standard text-to-image models in handling complex spatial relationships and multiple object attributes.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (conda create -n RPG python==3.9), activate it (conda activate RPG), and install dependencies (pip install -r requirements.txt).
  • Prerequisites: Requires Python 3.9+, PyTorch, and Hugging Face's diffusers library. For optimal performance, NVIDIA GPUs with at least 10GB VRAM are recommended, especially when using powerful MLLMs like GPT-4. Local MLLMs may require more VRAM.
  • Models: Download diffusion models (SDXL, SDXL-Turbo, Playground v2, CIVITA, AlbedoBase XL, DreamShaper XL, SD v1.5, SD v2.1) and MLLMs (GPT-4, Gemini-Pro, miniGPT-4, Llama2-13b-chat, Llama2-70b-chat).
  • Usage: Refer to RPG.py and example notebooks for detailed usage with GPT-4, local LLMs, and different diffusion pipelines (RegionalDiffusionPipeline for SD v1.x/v2.x, RegionalDiffusionXLPipeline for SDXL).
  • Links: Official Implementation, Hugging Face Spaces, Example Notebook

Highlighted Details

  • Supports generation of high-resolution images (e.g., 2048x1024).
  • Compatible with various diffusion backbones and MLLM architectures.
  • Enhancements include integration with advanced MLLMs (DeepSeek-R1, o3-mini, o1) and diffusion backbones (IterComp).
  • Offers ControlNet integration for Open Pose and Depth Map conditioning.

Maintenance & Community

The project is associated with ICML 2024 and acknowledges contributions from AUTOMATIC1111, regional-prompter, SAM, and diffusers. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository's licensing is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the underlying diffusion model licenses and any specific terms associated with the RPG framework itself.

Limitations & Caveats

The README suggests that using local LLMs can increase load times and VRAM usage. Achieving satisfactory results depends on proper configuration of base_prompt and base_ratio parameters, with guidance provided in the paper and examples.

Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
32 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
12 more.

stablediffusion by Stability-AI

0.1%
41k
Latent diffusion model for high-resolution image synthesis
created 2 years ago
updated 1 month ago
Starred by Dan Abramov Dan Abramov(Core Contributor to React), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
28 more.

stable-diffusion by CompVis

0.1%
71k
Latent text-to-image diffusion model
created 3 years ago
updated 1 year ago
Feedback? Help us improve.