CrossFlow by qihao067

PyTorch text-to-image generation framework

Created 11 months ago

323 stars

Top 84.1% on SourcePulse

Project Summary

This repository provides a PyTorch reimplementation of CrossFlow, a text-to-image generation framework designed for noise-free cross-modality evolution. It targets researchers and practitioners in computer vision and generative AI, offering flexibility in model architecture, language models, and training datasets compared to the original paper.

How It Works

CrossFlow utilizes a diffusion model architecture, supporting both DiT and the state-of-the-art DiMR. It processes text prompts through language models like CLIP or T5-XXL, generating images by evolving latent representations. This approach aims for a noise-free generation process, enabling smooth interpolations and arithmetic operations in the latent space for creative image manipulation.

Quick Start & Requirements

Installation: Clone the repository and install dependencies using pip3 install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121, pip3 install -U --pre triton, and pip3 install -r requirements.txt.
Prerequisites: PyTorch 2.1.2, CUDA 12.1. Requires downloading Stable Diffusion VAE and reference statistics.
Resources: Pretrained models are available for download. Training requires significant computational resources.
Links: project page, huggingface demo, paper, arxiv.

Highlighted Details

Supports both DiT and DiMR architectures.
Offers T5-XXL language model integration alongside CLIP.
Trained on open-source datasets (LAION-400M, JourneyDB) instead of proprietary data.
Enables latent space interpolation and arithmetic operations for image manipulation.
Provides pre-trained checkpoints for 256x256 and 512x512 resolutions.

Maintenance & Community

The project is associated with CVPR 2025 and lists Qihao Liu, Xi Yin, Alan Yuille, Andrew Brown, and Mannat Singh as contributors. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The project is created for research purposes. The specific license is not stated, but the research-focused nature may imply restrictions on commercial use.

Limitations & Caveats

T5-XXL models fine-tuned on JourneyDB may exhibit minor text-image misalignment compared to models trained from scratch. Linear interpolation sampling is currently limited to a single GPU.

CrossFlow by qihao067

Explore Similar Projects

e4t-diffusion by mkshing

MAGIC by yxuansu

METER by zdou0830

InstructCV by AlaaLab

magma by Aleph-Alpha-Research

BLIP3o by JiuhaiChen

Show-o by showlab

lang-seg by isl-org

OpenAI-CLIP by moein-shariatnia

smollm by huggingface

open_flamingo by mlfoundations

DALL-E by openai