datacomp by mlfoundations

Competition tooling for multimodal dataset curation to pre-train CLIP models

Created 2 years ago

765 stars

Top 45.6% on SourcePulse

View on GitHub

10 Experts Love This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Jeff Hammerbacher

Cofounder of Cloudera

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Omar Sanseviero

DevRel at Google DeepMind

and 6 more!

Project Summary

DataComp is a competition and tooling repository for designing and evaluating multimodal datasets for training CLIP models. It targets researchers and engineers aiming to advance multimodal understanding by focusing on dataset curation rather than model architecture. The benefit is a structured framework to discover optimal image-text datasets that improve downstream task performance.

How It Works

Participants curate image-text datasets and train fixed CLIP models on them. The core innovation lies in dataset design, with two tracks: "filtering" (using provided data pools) and "Bring Your Own Data" (BYOD). Both tracks offer multiple compute scales (small to xlarge) to accommodate diverse resources. Data is provided as webdataset shards, and participants select subsets using unique identifiers (UIDs) stored in NumPy arrays.

Quick Start & Requirements

Install dependencies: bash create_env.sh then conda activate datacomp.
Additional dependencies for cloud storage: pip install 'cloudpathlib[s3]'.
Download data pools: python download_upstream.py --scale $scale --data_dir $data_dir.
Data scales range from 12.8M (small) to 12.8B (xlarge) examples.
Full dataset download for xlarge scale requires ~450 TB.
Official documentation and paper are linked in the README.

Highlighted Details

Supports multiple dataset scales from 12.8M to 12.8B image-text pairs.
Provides baseline filtering strategies including CLIP score, LAION-2B, and image-based clustering.
Training utilizes torchrun with fixed hyperparameters per scale (ViT-B/32, B/16, L/14).
Evaluation and submission are integrated via Hugging Face Hub.

Maintenance & Community

The project is associated with the "DataComp: In search of the next generation of multimodal datasets" paper, with a comprehensive author list indicating significant academic backing. Further community interaction channels are not explicitly mentioned in the README.

Licensing & Compatibility

The repository itself does not explicitly state a license. However, it relies on and provides tooling for datasets and models, some of which may have their own licenses. Compatibility for commercial use or closed-source linking would require checking the licenses of specific datasets and model checkpoints used.

Limitations & Caveats

The xlarge scale dataset requires substantial storage (450 TB) and compute resources. Some baseline methods, particularly image-based filtering, require GPU resources. The README notes potential non-determinism in training runs due to factors like random network failures.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days