coyo-dataset  by kakaobrain

Image-text pair dataset for vision-language model training

Created 3 years ago
1,236 stars

Top 31.9% on SourcePulse

GitHubView on GitHub
Project Summary

COYO-700M is a massive, 747 million image-text pair dataset designed for training large-scale vision-language foundation models. It aims to complement existing datasets by providing a high-quality, diverse collection of image-text pairs derived from Common Crawl, with extensive filtering applied at both image and text levels.

How It Works

The dataset was constructed by processing approximately 10 billion image-text pairs from Common Crawl. A multi-stage filtering process was employed: image-level filtering removed small, low-resolution, or NSFW images, and de-duplicated images using perceptual hashing. Text-level filtering removed short, uninformative, or repetitive texts, and those containing profanity. Finally, samples were de-duplicated based on image perceptual hash and text content. This rigorous filtering aims to ensure data quality and usability for training advanced models.

Quick Start & Requirements

  • Download: Datasets are available via Huggingface Datasets. Refer to download/README.md for image dataset download details.
  • Prerequisites: PyTorch or TensorFlow for usage examples.

Highlighted Details

  • Contains 747M image-text pairs with extensive metadata including CLIP similarities, NSFW scores, and aesthetic scores.
  • COYO-Labeled-300M subset is available with machine-generated vision labels for 300M unique images across 21,841 classes.
  • Empirically validated by re-implementing ALIGN, unCLIP, and ViT models, achieving competitive performance.
  • Includes pre-training and fine-tuning code for ViT with weight files.

Maintenance & Community

  • Developed by Kakao Brain's Large-Scale AI Studio.
  • Contact: coyo@kakaobrain.com for cooperation.

Licensing & Compatibility

  • Licensed under CC-BY-4.0.
  • Users must adhere to the terms of the CC-BY-4.0 license, which permits commercial use but requires attribution. The underlying data (images and text) are subject to their original licenses.

Limitations & Caveats

Despite extensive filtering, the dataset's large scale means it was not human-curated, and may contain discomforting or inappropriate content. Users are solely responsible for any issues arising from its use, and Kakao Brain strongly recommends against using it directly for commercial products without special processing to clear inappropriate data.

Health Check
Last Commit

2 years ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0.3%
353
Vision-language research paper using LLMs
Created 2 years ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Simon Willison Simon Willison(Coauthor of Django), and
10 more.

LAVIS by salesforce

0.2%
11k
Library for language-vision AI research
Created 3 years ago
Updated 10 months ago
Feedback? Help us improve.