GoT  by rongyaofang

Reasoning-driven visual generation and editing framework

Created 6 months ago
285 stars

Top 91.9% on SourcePulse

GitHubView on GitHub
Project Summary

GoT introduces a novel paradigm for visual generation and editing by integrating explicit language reasoning with visual output. Targeting researchers and developers in multimodal AI, it enables more human-aligned image creation and modification through a unified, reasoning-guided framework.

How It Works

GoT employs a two-component architecture: a Semantic-Spatial Multimodal Large Language Model (MLLM) and a Semantic-Spatial Guidance Diffusion Module (SSGM). The MLLM, based on Qwen2.5-VL, generates detailed reasoning chains incorporating spatial information. The SSGM then utilizes this semantic and spatial guidance, along with reference images for editing tasks, to produce high-quality visual outputs via a diffusion process. This approach allows for precise control over object placement and relationships, enhancing compositional accuracy.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies via pip install -r requirements.txt.
  • Prerequisites: Python >= 3.8 (Anaconda recommended), PyTorch >= 2.0.1, NVIDIA GPU with CUDA.
  • Model Weights: Download GoT-6B, Qwen2.5-VL-3B-Instruct, and Stable Diffusion XL Base 1.0 models and place them in the ./pretrained directory.
  • Resources: Requires significant GPU memory for model weights and inference.
  • Documentation: Inference instructions are available in a notebook.

Highlighted Details

  • Achieves state-of-the-art performance on the GenEval benchmark for text-to-image generation, particularly in complex composition tasks.
  • Demonstrates superior results on image editing benchmarks like Emu-Edit and ImagenHub.
  • Released datasets (Laion-Aesthetics-High-Resolution-GoT, JourneyDB-GoT, OmniEdit-GoT) include detailed reasoning chains and spatial annotations.
  • The GoT-6B model combines Qwen2.5-VL-3B with SDXL.

Maintenance & Community

The project is associated with multiple research institutions including CUHK MMLab and HKU MMLab, with corresponding authors available via email. Issues can be raised on the GitHub repository.

Licensing & Compatibility

Released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The project is presented as a research artifact with a 2025 preprint citation. Specific performance details and potential limitations for real-world deployment are not extensively detailed in the README.

Health Check
Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
9 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.