coyo-dataset  by kakaobrain

Image-text pair dataset for vision-language model training

created 2 years ago
1,232 stars

Top 32.6% on sourcepulse

GitHubView on GitHub
Project Summary

COYO-700M is a massive, 747 million image-text pair dataset designed for training large-scale vision-language foundation models. It aims to complement existing datasets by providing a high-quality, diverse collection of image-text pairs derived from Common Crawl, with extensive filtering applied at both image and text levels.

How It Works

The dataset was constructed by processing approximately 10 billion image-text pairs from Common Crawl. A multi-stage filtering process was employed: image-level filtering removed small, low-resolution, or NSFW images, and de-duplicated images using perceptual hashing. Text-level filtering removed short, uninformative, or repetitive texts, and those containing profanity. Finally, samples were de-duplicated based on image perceptual hash and text content. This rigorous filtering aims to ensure data quality and usability for training advanced models.

Quick Start & Requirements

  • Download: Datasets are available via Huggingface Datasets. Refer to download/README.md for image dataset download details.
  • Prerequisites: PyTorch or TensorFlow for usage examples.

Highlighted Details

  • Contains 747M image-text pairs with extensive metadata including CLIP similarities, NSFW scores, and aesthetic scores.
  • COYO-Labeled-300M subset is available with machine-generated vision labels for 300M unique images across 21,841 classes.
  • Empirically validated by re-implementing ALIGN, unCLIP, and ViT models, achieving competitive performance.
  • Includes pre-training and fine-tuning code for ViT with weight files.

Maintenance & Community

  • Developed by Kakao Brain's Large-Scale AI Studio.
  • Contact: coyo@kakaobrain.com for cooperation.

Licensing & Compatibility

  • Licensed under CC-BY-4.0.
  • Users must adhere to the terms of the CC-BY-4.0 license, which permits commercial use but requires attribution. The underlying data (images and text) are subject to their original licenses.

Limitations & Caveats

Despite extensive filtering, the dataset's large scale means it was not human-curated, and may contain discomforting or inappropriate content. Users are solely responsible for any issues arising from its use, and Kakao Brain strongly recommends against using it directly for commercial products without special processing to clear inappropriate data.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
20 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.