diffusiondb by poloclub

Text-to-image prompt gallery dataset for generative AI research

Created 3 years ago

1,356 stars

Top 29.2% on SourcePulse

View on GitHub

2 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Elvis Saravia

Founder of DAIR.AI

Project Summary

DiffusionDB is a comprehensive dataset for text-to-image generation research, offering 14 million Stable Diffusion-generated images with associated prompts and hyperparameters. It targets researchers and developers working on generative models, deepfake detection, and human-AI interaction, providing a large-scale, human-actuated resource for understanding prompt engineering and model behavior.

How It Works

The dataset is structured into two subsets: DiffusionDB 2M (2 million images, 1.6TB) and DiffusionDB Large (14 million images, 6.5TB). Images are organized into modular folders, with each folder containing images and a JSON file mapping image filenames to their generation parameters (prompt, seed, CFG scale, steps, sampler). Metadata is also provided in Parquet format for efficient querying without downloading all images.

Quick Start & Requirements

Loading: Via Hugging Face Datasets (pip install datasets Pillow) or a provided Python downloader script (download.py).
Prerequisites: Python, Hugging Face Datasets library, Pillow.
Storage: DiffusionDB 2M requires 1.6TB, DiffusionDB Large requires 6.5TB.
Resources: Loading metadata only requires Pandas and urlretrieve.
Documentation: Dataset Preview, example-loading.ipynb

Highlighted Details

Contains 14 million images and 1.8 million unique prompts.
Metadata tables (Parquet) allow efficient access to prompts and parameters without downloading images.
Includes NSFW likelihood scores for both images and prompts.
Data collected from the official Stable Diffusion Discord server.

Maintenance & Community

Created by Jay Wang, Evan Montoya, David Munechika, Alex Yang, Ben Hoover, and Polo Chau.
Data updates are periodic based on user reports via a Google Form.
Contact: Open an issue or contact Jay Wang.

Licensing & Compatibility

Dataset: CC0 1.0 License (Public Domain).
Code: MIT License.
Compatible with commercial use and closed-source linking due to permissive licensing.

Limitations & Caveats

The dataset may contain some NSFW images despite an NSFW filter, and users are advised to apply their own filtering based on provided NSFW scores. Timestamps may not be accurate for duplicate images.

Health Check

Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days