Training-free paradigm for text-to-image generation/editing
Top 24.3% on sourcepulse
This repository provides the official implementation for RPG, a training-free paradigm for advanced text-to-image generation and editing. It leverages multimodal large language models (MLLMs) for prompt recaptioning and regional planning, combined with regional diffusion techniques, to achieve state-of-the-art results, particularly for complex compositional prompts. The framework is designed for researchers and practitioners in AI image generation seeking enhanced control and fidelity.
How It Works
RPG integrates MLLMs (like GPT-4, Gemini-Pro, or local models such as miniGPT-4) to break down complex text prompts into regional descriptions and spatial layouts. This structured input is then fed into a complementary regional diffusion model, allowing for precise control over different image areas. This approach enables the generation of images with high resolution and intricate details, overcoming limitations of standard text-to-image models in handling complex spatial relationships and multiple object attributes.
Quick Start & Requirements
conda create -n RPG python==3.9
), activate it (conda activate RPG
), and install dependencies (pip install -r requirements.txt
).diffusers
library. For optimal performance, NVIDIA GPUs with at least 10GB VRAM are recommended, especially when using powerful MLLMs like GPT-4. Local MLLMs may require more VRAM.RPG.py
and example notebooks for detailed usage with GPT-4, local LLMs, and different diffusion pipelines (RegionalDiffusionPipeline
for SD v1.x/v2.x, RegionalDiffusionXLPipeline
for SDXL).Highlighted Details
Maintenance & Community
The project is associated with ICML 2024 and acknowledges contributions from AUTOMATIC1111, regional-prompter, SAM, and diffusers. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
The repository's licensing is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the underlying diffusion model licenses and any specific terms associated with the RPG framework itself.
Limitations & Caveats
The README suggests that using local LLMs can increase load times and VRAM usage. Achieving satisfactory results depends on proper configuration of base_prompt
and base_ratio
parameters, with guidance provided in the paper and examples.
6 months ago
1 day