datacomp  by mlfoundations

Competition tooling for multimodal dataset curation to pre-train CLIP models

created 2 years ago
728 stars

Top 48.5% on sourcepulse

GitHubView on GitHub
Project Summary

DataComp is a competition and tooling repository for designing and evaluating multimodal datasets for training CLIP models. It targets researchers and engineers aiming to advance multimodal understanding by focusing on dataset curation rather than model architecture. The benefit is a structured framework to discover optimal image-text datasets that improve downstream task performance.

How It Works

Participants curate image-text datasets and train fixed CLIP models on them. The core innovation lies in dataset design, with two tracks: "filtering" (using provided data pools) and "Bring Your Own Data" (BYOD). Both tracks offer multiple compute scales (small to xlarge) to accommodate diverse resources. Data is provided as webdataset shards, and participants select subsets using unique identifiers (UIDs) stored in NumPy arrays.

Quick Start & Requirements

  • Install dependencies: bash create_env.sh then conda activate datacomp.
  • Additional dependencies for cloud storage: pip install 'cloudpathlib[s3]'.
  • Download data pools: python download_upstream.py --scale $scale --data_dir $data_dir.
  • Data scales range from 12.8M (small) to 12.8B (xlarge) examples.
  • Full dataset download for xlarge scale requires ~450 TB.
  • Official documentation and paper are linked in the README.

Highlighted Details

  • Supports multiple dataset scales from 12.8M to 12.8B image-text pairs.
  • Provides baseline filtering strategies including CLIP score, LAION-2B, and image-based clustering.
  • Training utilizes torchrun with fixed hyperparameters per scale (ViT-B/32, B/16, L/14).
  • Evaluation and submission are integrated via Hugging Face Hub.

Maintenance & Community

The project is associated with the "DataComp: In search of the next generation of multimodal datasets" paper, with a comprehensive author list indicating significant academic backing. Further community interaction channels are not explicitly mentioned in the README.

Licensing & Compatibility

The repository itself does not explicitly state a license. However, it relies on and provides tooling for datasets and models, some of which may have their own licenses. Compatibility for commercial use or closed-source linking would require checking the licenses of specific datasets and model checkpoints used.

Limitations & Caveats

The xlarge scale dataset requires substantial storage (450 TB) and compute resources. Some baseline methods, particularly image-based filtering, require GPU resources. The README notes potential non-determinism in training runs due to factors like random network failures.

Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
28 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.