RPG-DiffusionMaster  by YangLing0818

Training-free paradigm for text-to-image generation/editing

Created 1 year ago
1,824 stars

Top 23.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the official implementation for RPG, a training-free paradigm for advanced text-to-image generation and editing. It leverages multimodal large language models (MLLMs) for prompt recaptioning and regional planning, combined with regional diffusion techniques, to achieve state-of-the-art results, particularly for complex compositional prompts. The framework is designed for researchers and practitioners in AI image generation seeking enhanced control and fidelity.

How It Works

RPG integrates MLLMs (like GPT-4, Gemini-Pro, or local models such as miniGPT-4) to break down complex text prompts into regional descriptions and spatial layouts. This structured input is then fed into a complementary regional diffusion model, allowing for precise control over different image areas. This approach enables the generation of images with high resolution and intricate details, overcoming limitations of standard text-to-image models in handling complex spatial relationships and multiple object attributes.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (conda create -n RPG python==3.9), activate it (conda activate RPG), and install dependencies (pip install -r requirements.txt).
  • Prerequisites: Requires Python 3.9+, PyTorch, and Hugging Face's diffusers library. For optimal performance, NVIDIA GPUs with at least 10GB VRAM are recommended, especially when using powerful MLLMs like GPT-4. Local MLLMs may require more VRAM.
  • Models: Download diffusion models (SDXL, SDXL-Turbo, Playground v2, CIVITA, AlbedoBase XL, DreamShaper XL, SD v1.5, SD v2.1) and MLLMs (GPT-4, Gemini-Pro, miniGPT-4, Llama2-13b-chat, Llama2-70b-chat).
  • Usage: Refer to RPG.py and example notebooks for detailed usage with GPT-4, local LLMs, and different diffusion pipelines (RegionalDiffusionPipeline for SD v1.x/v2.x, RegionalDiffusionXLPipeline for SDXL).
  • Links: Official Implementation, Hugging Face Spaces, Example Notebook

Highlighted Details

  • Supports generation of high-resolution images (e.g., 2048x1024).
  • Compatible with various diffusion backbones and MLLM architectures.
  • Enhancements include integration with advanced MLLMs (DeepSeek-R1, o3-mini, o1) and diffusion backbones (IterComp).
  • Offers ControlNet integration for Open Pose and Depth Map conditioning.

Maintenance & Community

The project is associated with ICML 2024 and acknowledges contributions from AUTOMATIC1111, regional-prompter, SAM, and diffusers. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository's licensing is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the underlying diffusion model licenses and any specific terms associated with the RPG framework itself.

Limitations & Caveats

The README suggests that using local LLMs can increase load times and VRAM usage. Achieving satisfactory results depends on proper configuration of base_prompt and base_ratio parameters, with guidance provided in the paper and examples.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
2 more.

IP-Adapter by tencent-ailab

0.3%
6k
Adapter for image prompt in text-to-image diffusion models
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
11 more.

IF by deep-floyd

0.0%
8k
Text-to-image model for photorealistic synthesis and language understanding
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.