GoT by rongyaofang

Reasoning-driven visual generation and editing framework

Created 10 months ago

300 stars

Top 88.9% on SourcePulse

Project Summary

GoT introduces a novel paradigm for visual generation and editing by integrating explicit language reasoning with visual output. Targeting researchers and developers in multimodal AI, it enables more human-aligned image creation and modification through a unified, reasoning-guided framework.

How It Works

GoT employs a two-component architecture: a Semantic-Spatial Multimodal Large Language Model (MLLM) and a Semantic-Spatial Guidance Diffusion Module (SSGM). The MLLM, based on Qwen2.5-VL, generates detailed reasoning chains incorporating spatial information. The SSGM then utilizes this semantic and spatial guidance, along with reference images for editing tasks, to produce high-quality visual outputs via a diffusion process. This approach allows for precise control over object placement and relationships, enhancing compositional accuracy.

Quick Start & Requirements

Installation: Clone the repository and install dependencies via pip install -r requirements.txt.
Prerequisites: Python >= 3.8 (Anaconda recommended), PyTorch >= 2.0.1, NVIDIA GPU with CUDA.
Model Weights: Download GoT-6B, Qwen2.5-VL-3B-Instruct, and Stable Diffusion XL Base 1.0 models and place them in the ./pretrained directory.
Resources: Requires significant GPU memory for model weights and inference.
Documentation: Inference instructions are available in a notebook.

Highlighted Details

Achieves state-of-the-art performance on the GenEval benchmark for text-to-image generation, particularly in complex composition tasks.
Demonstrates superior results on image editing benchmarks like Emu-Edit and ImagenHub.
Released datasets (Laion-Aesthetics-High-Resolution-GoT, JourneyDB-GoT, OmniEdit-GoT) include detailed reasoning chains and spatial annotations.
The GoT-6B model combines Qwen2.5-VL-3B with SDXL.

Maintenance & Community

The project is associated with multiple research institutions including CUHK MMLab and HKU MMLab, with corresponding authors available via email. Issues can be raised on the GitHub repository.

Licensing & Compatibility

Released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The project is presented as a research artifact with a 2025 preprint citation. Specific performance details and potential limitations for real-world deployment are not extensively detailed in the README.

Health Check

Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days