OneDiffusion  by lehduong

Versatile diffusion model for bidirectional image synthesis and understanding (CVPR 2025 paper)

created 8 months ago
648 stars

Top 52.4% on sourcepulse

GitHubView on GitHub
Project Summary

OneDiffusion is a versatile diffusion model designed for large-scale, bidirectional image synthesis and understanding across diverse tasks. It targets researchers and practitioners in computer vision who need a unified framework for tasks like text-to-image generation, image editing, and multiview synthesis, offering a single model capable of handling multiple modalities and operations.

How It Works

OneDiffusion leverages a unified diffusion architecture that supports various conditional inputs and outputs. It employs a flexible prompt-based interface, allowing users to specify tasks and conditions using natural language and image inputs. The model's strength lies in its ability to perform zero-shot task combinations by integrating different task tokens and conditioning information, enabling novel applications without task-specific fine-tuning.

Quick Start & Requirements

  • Installation: Use Conda to create an environment and install dependencies:
    conda create -n onediffusion_env python=3.8
    conda activate onediffusion_env
    pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118
    pip install "git+https://github.com/facebookresearch/pytorch3d.git"
    pip install -r requirements.txt
    
  • Prerequisites: CUDA 11.8, Python 3.8.
  • Demo: Requires a GPU with at least 21GB VRAM (Molmo captioner), 27GB (LLaVA), or 12GB (manual captioning).
  • Resources: Official Huggingface space available for demo.

Highlighted Details

  • Supports text-to-image, ID customization, multiview generation, condition-to-image, subject-driven generation, and text-guided image editing.
  • Achieves subject-driven generation after fine-tuning on Subject-200K and OmniEdit datasets.
  • Demonstrates zero-shot task combinations, though robustness may vary.

Maintenance & Community

  • Official repository for the CVPR 2025 paper "One Diffusion to Generate Them All".
  • Huggingface space released for demo.

Licensing & Compatibility

  • Model weights are released under a CC BY-NC license due to training on non-commercially licensed datasets.
  • Not suitable for commercial use.

Limitations & Caveats

Performance on zero-shot task combinations may not be robust, and prompt/caption order can affect behavior. Fine-tuning is recommended for combined tasks to improve performance and simplify usage.

Health Check
Last commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
28 stars in the last 90 days

Explore Similar Projects

Starred by Dan Abramov Dan Abramov(Core Contributor to React), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
28 more.

stable-diffusion by CompVis

0.1%
71k
Latent text-to-image diffusion model
created 3 years ago
updated 1 year ago
Feedback? Help us improve.