GoT  by rongyaofang

Reasoning-driven visual generation and editing framework

created 4 months ago
271 stars

Top 95.8% on sourcepulse

GitHubView on GitHub
Project Summary

GoT introduces a novel paradigm for visual generation and editing by integrating explicit language reasoning with visual output. Targeting researchers and developers in multimodal AI, it enables more human-aligned image creation and modification through a unified, reasoning-guided framework.

How It Works

GoT employs a two-component architecture: a Semantic-Spatial Multimodal Large Language Model (MLLM) and a Semantic-Spatial Guidance Diffusion Module (SSGM). The MLLM, based on Qwen2.5-VL, generates detailed reasoning chains incorporating spatial information. The SSGM then utilizes this semantic and spatial guidance, along with reference images for editing tasks, to produce high-quality visual outputs via a diffusion process. This approach allows for precise control over object placement and relationships, enhancing compositional accuracy.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies via pip install -r requirements.txt.
  • Prerequisites: Python >= 3.8 (Anaconda recommended), PyTorch >= 2.0.1, NVIDIA GPU with CUDA.
  • Model Weights: Download GoT-6B, Qwen2.5-VL-3B-Instruct, and Stable Diffusion XL Base 1.0 models and place them in the ./pretrained directory.
  • Resources: Requires significant GPU memory for model weights and inference.
  • Documentation: Inference instructions are available in a notebook.

Highlighted Details

  • Achieves state-of-the-art performance on the GenEval benchmark for text-to-image generation, particularly in complex composition tasks.
  • Demonstrates superior results on image editing benchmarks like Emu-Edit and ImagenHub.
  • Released datasets (Laion-Aesthetics-High-Resolution-GoT, JourneyDB-GoT, OmniEdit-GoT) include detailed reasoning chains and spatial annotations.
  • The GoT-6B model combines Qwen2.5-VL-3B with SDXL.

Maintenance & Community

The project is associated with multiple research institutions including CUHK MMLab and HKU MMLab, with corresponding authors available via email. Issues can be raised on the GitHub repository.

Licensing & Compatibility

Released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The project is presented as a research artifact with a 2025 preprint citation. Specific performance details and potential limitations for real-world deployment are not extensively detailed in the README.

Health Check
Last commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
38 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.