datacomp  by mlfoundations

Competition tooling for multimodal dataset curation to pre-train CLIP models

Created 2 years ago
743 stars

Top 46.7% on SourcePulse

GitHubView on GitHub
Project Summary

DataComp is a competition and tooling repository for designing and evaluating multimodal datasets for training CLIP models. It targets researchers and engineers aiming to advance multimodal understanding by focusing on dataset curation rather than model architecture. The benefit is a structured framework to discover optimal image-text datasets that improve downstream task performance.

How It Works

Participants curate image-text datasets and train fixed CLIP models on them. The core innovation lies in dataset design, with two tracks: "filtering" (using provided data pools) and "Bring Your Own Data" (BYOD). Both tracks offer multiple compute scales (small to xlarge) to accommodate diverse resources. Data is provided as webdataset shards, and participants select subsets using unique identifiers (UIDs) stored in NumPy arrays.

Quick Start & Requirements

  • Install dependencies: bash create_env.sh then conda activate datacomp.
  • Additional dependencies for cloud storage: pip install 'cloudpathlib[s3]'.
  • Download data pools: python download_upstream.py --scale $scale --data_dir $data_dir.
  • Data scales range from 12.8M (small) to 12.8B (xlarge) examples.
  • Full dataset download for xlarge scale requires ~450 TB.
  • Official documentation and paper are linked in the README.

Highlighted Details

  • Supports multiple dataset scales from 12.8M to 12.8B image-text pairs.
  • Provides baseline filtering strategies including CLIP score, LAION-2B, and image-based clustering.
  • Training utilizes torchrun with fixed hyperparameters per scale (ViT-B/32, B/16, L/14).
  • Evaluation and submission are integrated via Hugging Face Hub.

Maintenance & Community

The project is associated with the "DataComp: In search of the next generation of multimodal datasets" paper, with a comprehensive author list indicating significant academic backing. Further community interaction channels are not explicitly mentioned in the README.

Licensing & Compatibility

The repository itself does not explicitly state a license. However, it relies on and provides tooling for datasets and models, some of which may have their own licenses. Compatibility for commercial use or closed-source linking would require checking the licenses of specific datasets and model checkpoints used.

Limitations & Caveats

The xlarge scale dataset requires substantial storage (450 TB) and compute resources. Some baseline methods, particularly image-based filtering, require GPU resources. The README notes potential non-determinism in training runs due to factors like random network failures.

Health Check
Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
373
Multimodal framework for vision-and-language transformer research
Created 3 years ago
Updated 2 years ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), and
9 more.

lilac by databricks

0.1%
1k
Data exploration tool for LLM dataset curation and quality control
Created 2 years ago
Updated 1 year ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Jiaming Song Jiaming Song(Chief Scientist at Luma AI), and
1 more.

Curator by NVIDIA-NeMo

1.3%
1k
Data curation toolkit for LLMs
Created 1 year ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Simon Willison Simon Willison(Coauthor of Django), and
10 more.

LAVIS by salesforce

0.2%
11k
Library for language-vision AI research
Created 3 years ago
Updated 10 months ago
Feedback? Help us improve.