OneDiffusion by lehduong

Versatile diffusion model for bidirectional image synthesis and understanding (CVPR 2025 paper)

Created 1 year ago

662 stars

Top 50.9% on SourcePulse

Project Summary

OneDiffusion is a versatile diffusion model designed for large-scale, bidirectional image synthesis and understanding across diverse tasks. It targets researchers and practitioners in computer vision who need a unified framework for tasks like text-to-image generation, image editing, and multiview synthesis, offering a single model capable of handling multiple modalities and operations.

How It Works

OneDiffusion leverages a unified diffusion architecture that supports various conditional inputs and outputs. It employs a flexible prompt-based interface, allowing users to specify tasks and conditions using natural language and image inputs. The model's strength lies in its ability to perform zero-shot task combinations by integrating different task tokens and conditioning information, enabling novel applications without task-specific fine-tuning.

Quick Start & Requirements

Installation: Use Conda to create an environment and install dependencies:

conda create -n onediffusion_env python=3.8
conda activate onediffusion_env
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118
pip install "git+https://github.com/facebookresearch/pytorch3d.git"
pip install -r requirements.txt

Prerequisites: CUDA 11.8, Python 3.8.
Demo: Requires a GPU with at least 21GB VRAM (Molmo captioner), 27GB (LLaVA), or 12GB (manual captioning).
Resources: Official Huggingface space available for demo.

Highlighted Details

Supports text-to-image, ID customization, multiview generation, condition-to-image, subject-driven generation, and text-guided image editing.
Achieves subject-driven generation after fine-tuning on Subject-200K and OmniEdit datasets.
Demonstrates zero-shot task combinations, though robustness may vary.

Maintenance & Community

Official repository for the CVPR 2025 paper "One Diffusion to Generate Them All".
Huggingface space released for demo.

Licensing & Compatibility

Model weights are released under a CC BY-NC license due to training on non-commercially licensed datasets.
Not suitable for commercial use.

Limitations & Caveats

Performance on zero-shot task combinations may not be robust, and prompt/caption order can affect behavior. Fine-tuning is recommended for combined tasks to improve performance and simplify usage.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days