Image-Generation-CoT  by ZiyuGuo99

Research paper exploring CoT reasoning for image generation

created 6 months ago
775 stars

Top 46.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the first comprehensive investigation into applying Chain-of-Thought (CoT) reasoning techniques to autoregressive image generation. It targets researchers and practitioners in computer vision and generative AI, offering methods to enhance image quality and coherence through step-by-step verification and reinforcement.

How It Works

The project explores three CoT reasoning techniques adapted for image generation: 1) Test-time computation scaling using reward models like ORM, PARM, and PARM++ for adaptive step assessment and self-correction; 2) Preference alignment via Direct Preference Optimization (DPO) to align model outputs with desired characteristics; and 3) Integration of these methods for synergistic improvements. The proposed PARM and PARM++ reward models are specialized for autoregressive generation, assessing each step and enabling self-correction.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (python=3.10), install PyTorch/TorchVision, and then install project dependencies (requirements.txt). It also requires installing mmdetection (2.x branch) and LLaVA-NeXT with its training dependencies.
  • Prerequisites: PyTorch, TorchVision, mmdetection, LLaVA-NeXT. Checkpoints for reward models and DPO, as well as Mask2Former object detector and training data, need to be downloaded separately.
  • Resources: Training commands utilize torchrun with nproc_per_node=8, suggesting a multi-GPU setup is recommended for training and evaluation.

Highlighted Details

  • Investigates ORM, PARM, PARM++, and DPO for autoregressive image generation.
  • Demonstrates significant improvements in image generation performance through CoT strategies.
  • Proposes PARM and PARM++ reward models specifically designed for autoregressive image generation.
  • Offers code for training and evaluation of baseline, ORM, PARM, and DPO-based models.

Maintenance & Community

The project is associated with the CVPR 2025 paper "Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step." The repository has recent updates (March 2025) releasing training code and data for DPO and fine-tuned ORM/PARM.

Licensing & Compatibility

The repository does not explicitly state a license in the README. The code dependencies (mmdetection, LLaVA-NeXT) have their own licenses, which may impact commercial use or closed-source integration.

Limitations & Caveats

The project is presented as research from CVPR 2025, and while code and checkpoints are released, it may still be in an experimental or alpha stage. Specific hardware requirements beyond multi-GPU usage for training are not detailed.

Health Check
Last commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
135 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
4 more.

taming-transformers by CompVis

0.1%
6k
Image synthesis research paper using transformers
created 4 years ago
updated 1 year ago
Feedback? Help us improve.