Image-text pair dataset for vision-language model training
Top 32.6% on sourcepulse
COYO-700M is a massive, 747 million image-text pair dataset designed for training large-scale vision-language foundation models. It aims to complement existing datasets by providing a high-quality, diverse collection of image-text pairs derived from Common Crawl, with extensive filtering applied at both image and text levels.
How It Works
The dataset was constructed by processing approximately 10 billion image-text pairs from Common Crawl. A multi-stage filtering process was employed: image-level filtering removed small, low-resolution, or NSFW images, and de-duplicated images using perceptual hashing. Text-level filtering removed short, uninformative, or repetitive texts, and those containing profanity. Finally, samples were de-duplicated based on image perceptual hash and text content. This rigorous filtering aims to ensure data quality and usability for training advanced models.
Quick Start & Requirements
download/README.md
for image dataset download details.Highlighted Details
Maintenance & Community
coyo@kakaobrain.com
for cooperation.Licensing & Compatibility
Limitations & Caveats
Despite extensive filtering, the dataset's large scale means it was not human-curated, and may contain discomforting or inappropriate content. Users are solely responsible for any issues arising from its use, and Kakao Brain strongly recommends against using it directly for commercial products without special processing to clear inappropriate data.
2 years ago
Inactive