diffusiondb  by poloclub

Text-to-image prompt gallery dataset for generative AI research

created 2 years ago
1,291 stars

Top 31.5% on sourcepulse

GitHubView on GitHub
Project Summary

DiffusionDB is a comprehensive dataset for text-to-image generation research, offering 14 million Stable Diffusion-generated images with associated prompts and hyperparameters. It targets researchers and developers working on generative models, deepfake detection, and human-AI interaction, providing a large-scale, human-actuated resource for understanding prompt engineering and model behavior.

How It Works

The dataset is structured into two subsets: DiffusionDB 2M (2 million images, 1.6TB) and DiffusionDB Large (14 million images, 6.5TB). Images are organized into modular folders, with each folder containing images and a JSON file mapping image filenames to their generation parameters (prompt, seed, CFG scale, steps, sampler). Metadata is also provided in Parquet format for efficient querying without downloading all images.

Quick Start & Requirements

  • Loading: Via Hugging Face Datasets (pip install datasets Pillow) or a provided Python downloader script (download.py).
  • Prerequisites: Python, Hugging Face Datasets library, Pillow.
  • Storage: DiffusionDB 2M requires 1.6TB, DiffusionDB Large requires 6.5TB.
  • Resources: Loading metadata only requires Pandas and urlretrieve.
  • Documentation: Dataset Preview, example-loading.ipynb

Highlighted Details

  • Contains 14 million images and 1.8 million unique prompts.
  • Metadata tables (Parquet) allow efficient access to prompts and parameters without downloading images.
  • Includes NSFW likelihood scores for both images and prompts.
  • Data collected from the official Stable Diffusion Discord server.

Maintenance & Community

  • Created by Jay Wang, Evan Montoya, David Munechika, Alex Yang, Ben Hoover, and Polo Chau.
  • Data updates are periodic based on user reports via a Google Form.
  • Contact: Open an issue or contact Jay Wang.

Licensing & Compatibility

  • Dataset: CC0 1.0 License (Public Domain).
  • Code: MIT License.
  • Compatible with commercial use and closed-source linking due to permissive licensing.

Limitations & Caveats

The dataset may contain some NSFW images despite an NSFW filter, and users are advised to apply their own filtering based on provided NSFW scores. Timestamps may not be accurate for duplicate images.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
20 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.