Competition tooling for multimodal dataset curation to pre-train CLIP models
Top 48.5% on sourcepulse
DataComp is a competition and tooling repository for designing and evaluating multimodal datasets for training CLIP models. It targets researchers and engineers aiming to advance multimodal understanding by focusing on dataset curation rather than model architecture. The benefit is a structured framework to discover optimal image-text datasets that improve downstream task performance.
How It Works
Participants curate image-text datasets and train fixed CLIP models on them. The core innovation lies in dataset design, with two tracks: "filtering" (using provided data pools) and "Bring Your Own Data" (BYOD). Both tracks offer multiple compute scales (small to xlarge) to accommodate diverse resources. Data is provided as webdataset shards, and participants select subsets using unique identifiers (UIDs) stored in NumPy arrays.
Quick Start & Requirements
bash create_env.sh
then conda activate datacomp
.pip install 'cloudpathlib[s3]'
.python download_upstream.py --scale $scale --data_dir $data_dir
.Highlighted Details
torchrun
with fixed hyperparameters per scale (ViT-B/32, B/16, L/14).Maintenance & Community
The project is associated with the "DataComp: In search of the next generation of multimodal datasets" paper, with a comprehensive author list indicating significant academic backing. Further community interaction channels are not explicitly mentioned in the README.
Licensing & Compatibility
The repository itself does not explicitly state a license. However, it relies on and provides tooling for datasets and models, some of which may have their own licenses. Compatibility for commercial use or closed-source linking would require checking the licenses of specific datasets and model checkpoints used.
Limitations & Caveats
The xlarge scale dataset requires substantial storage (450 TB) and compute resources. Some baseline methods, particularly image-based filtering, require GPU resources. The README notes potential non-determinism in training runs due to factors like random network failures.
3 months ago
1 day