T2I-R1  by CaraJ7

Text-to-image generation research paper using reinforcement learning

Created 6 months ago
411 stars

Top 71.0% on SourcePulse

GitHubView on GitHub
Project Summary

T2I-R1 is a novel text-to-image generation model that enhances image quality and prompt alignment through a bi-level Chain-of-Thought (CoT) reasoning process optimized with reinforcement learning. It targets researchers and developers in generative AI seeking to improve the controllability and fidelity of text-to-image synthesis.

How It Works

T2I-R1 employs a dual CoT strategy: Semantic-level CoT for global image structure and object planning, and Token-level CoT for fine-grained pixel generation and coherence. These are optimized collaboratively using BiCoT-GRPO, an RL approach that integrates an ensemble of reward functions (HPS, GIT, GroundingDINO, ORM) within a single training step. This bi-level optimization aims to provide more explicit control over the generation process, leading to improved results.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (conda create -n t2i-r1 python=3.10), activate it, install PyTorch/TorchVision per official instructions, and then pip install -r requirements.txt from the src directory.
  • Prerequisites: GroundingDINO and LLaVA-NeXT are required for specific reward models. Reward model checkpoints (HPS, GIT, GroundingDINO, ORM) must be downloaded separately.
  • Setup: Requires downloading multiple checkpoints and potentially modifying dependencies for specific reward models.
  • Links: Official Paper, CVPR 2025 Previous Work

Highlighted Details

  • Reinforces image generation using collaborative semantic-level and token-level Chain-of-Thought (CoT).
  • Utilizes BiCoT-GRPO with an ensemble of reward functions (HPS, GIT, GroundingDINO, ORM).
  • Codebase modified for Zero3 training and includes integrated reward model repositories.

Maintenance & Community

The project is associated with the paper "T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT" accepted at CVPR 2025. The repository is the official release for this work.

Licensing & Compatibility

Released under Apache License 2.0. Checkpoints are for research purposes only. Users are permitted to create images but must comply with local laws and use responsibly.

Limitations & Caveats

The project is presented as a research release with checkpoints intended solely for research purposes. The README indicates that ORM checkpoint and reward code are coming soon, and general checkpoints are expected within two weeks.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 30 days

Explore Similar Projects

Starred by Max Howell Max Howell(Author of Homebrew), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

big-sleep by lucidrains

0.0%
3k
CLI tool for text-to-image generation
Created 4 years ago
Updated 3 years ago
Feedback? Help us improve.