Image-Generation-CoT by ZiyuGuo99

Research paper exploring CoT reasoning for image generation

Created 1 year ago

856 stars

Top 41.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Alex Yu

Research Scientist at OpenAI; Cofounder of Luma AI

Project Summary

This repository provides the first comprehensive investigation into applying Chain-of-Thought (CoT) reasoning techniques to autoregressive image generation. It targets researchers and practitioners in computer vision and generative AI, offering methods to enhance image quality and coherence through step-by-step verification and reinforcement.

How It Works

The project explores three CoT reasoning techniques adapted for image generation: 1) Test-time computation scaling using reward models like ORM, PARM, and PARM++ for adaptive step assessment and self-correction; 2) Preference alignment via Direct Preference Optimization (DPO) to align model outputs with desired characteristics; and 3) Integration of these methods for synergistic improvements. The proposed PARM and PARM++ reward models are specialized for autoregressive generation, assessing each step and enabling self-correction.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment (python=3.10), install PyTorch/TorchVision, and then install project dependencies (requirements.txt). It also requires installing mmdetection (2.x branch) and LLaVA-NeXT with its training dependencies.
Prerequisites: PyTorch, TorchVision, mmdetection, LLaVA-NeXT. Checkpoints for reward models and DPO, as well as Mask2Former object detector and training data, need to be downloaded separately.
Resources: Training commands utilize torchrun with nproc_per_node=8, suggesting a multi-GPU setup is recommended for training and evaluation.

Highlighted Details

Investigates ORM, PARM, PARM++, and DPO for autoregressive image generation.
Demonstrates significant improvements in image generation performance through CoT strategies.
Proposes PARM and PARM++ reward models specifically designed for autoregressive image generation.
Offers code for training and evaluation of baseline, ORM, PARM, and DPO-based models.

Maintenance & Community

The project is associated with the CVPR 2025 paper "Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step." The repository has recent updates (March 2025) releasing training code and data for DPO and fine-tuned ORM/PARM.

Licensing & Compatibility

The repository does not explicitly state a license in the README. The code dependencies (mmdetection, LLaVA-NeXT) have their own licenses, which may impact commercial use or closed-source integration.

Limitations & Caveats

The project is presented as research from CVPR 2025, and while code and checkpoints are released, it may still be in an experimental or alpha stage. Specific hardware requirements beyond multi-GPU usage for training are not detailed.

Health Check

Last Commit

9 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days