Reasoning-driven visual generation and editing framework
Top 95.8% on sourcepulse
GoT introduces a novel paradigm for visual generation and editing by integrating explicit language reasoning with visual output. Targeting researchers and developers in multimodal AI, it enables more human-aligned image creation and modification through a unified, reasoning-guided framework.
How It Works
GoT employs a two-component architecture: a Semantic-Spatial Multimodal Large Language Model (MLLM) and a Semantic-Spatial Guidance Diffusion Module (SSGM). The MLLM, based on Qwen2.5-VL, generates detailed reasoning chains incorporating spatial information. The SSGM then utilizes this semantic and spatial guidance, along with reference images for editing tasks, to produce high-quality visual outputs via a diffusion process. This approach allows for precise control over object placement and relationships, enhancing compositional accuracy.
Quick Start & Requirements
pip install -r requirements.txt
../pretrained
directory.Highlighted Details
Maintenance & Community
The project is associated with multiple research institutions including CUHK MMLab and HKU MMLab, with corresponding authors available via email. Issues can be raised on the GitHub repository.
Licensing & Compatibility
Released under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The project is presented as a research artifact with a 2025 preprint citation. Specific performance details and potential limitations for real-world deployment are not extensively detailed in the README.
3 months ago
Inactive