coyo-dataset by kakaobrain

Image-text pair dataset for vision-language model training

Created 3 years ago

1,248 stars

Top 31.5% on SourcePulse

View on GitHub

6 Experts Love This Project

Jesse Clark

Cofounder of Marqo

Elie Bursztein

Cybersecurity Lead at Google DeepMind

Ross Wightman

Author of timm; CV at Hugging Face

Andreas Blattmann

Cofounder of Black Forest Labs

and 2 more!

Project Summary

COYO-700M is a massive, 747 million image-text pair dataset designed for training large-scale vision-language foundation models. It aims to complement existing datasets by providing a high-quality, diverse collection of image-text pairs derived from Common Crawl, with extensive filtering applied at both image and text levels.

How It Works

The dataset was constructed by processing approximately 10 billion image-text pairs from Common Crawl. A multi-stage filtering process was employed: image-level filtering removed small, low-resolution, or NSFW images, and de-duplicated images using perceptual hashing. Text-level filtering removed short, uninformative, or repetitive texts, and those containing profanity. Finally, samples were de-duplicated based on image perceptual hash and text content. This rigorous filtering aims to ensure data quality and usability for training advanced models.

Quick Start & Requirements

Download: Datasets are available via Huggingface Datasets. Refer to download/README.md for image dataset download details.
Prerequisites: PyTorch or TensorFlow for usage examples.

Highlighted Details

Contains 747M image-text pairs with extensive metadata including CLIP similarities, NSFW scores, and aesthetic scores.
COYO-Labeled-300M subset is available with machine-generated vision labels for 300M unique images across 21,841 classes.
Empirically validated by re-implementing ALIGN, unCLIP, and ViT models, achieving competitive performance.
Includes pre-training and fine-tuning code for ViT with weight files.

Maintenance & Community

Developed by Kakao Brain's Large-Scale AI Studio.
Contact: coyo@kakaobrain.com for cooperation.

Licensing & Compatibility

Licensed under CC-BY-4.0.
Users must adhere to the terms of the CC-BY-4.0 license, which permits commercial use but requires attribution. The underlying data (images and text) are subject to their original licenses.

Limitations & Caveats

Despite extensive filtering, the dataset's large scale means it was not human-curated, and may contain discomforting or inappropriate content. Users are solely responsible for any issues arising from its use, and Kakao Brain strongly recommends against using it directly for commercial products without special processing to clear inappropriate data.

Health Check

Last Commit

3 years ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days