mega-data-factory  by duoan

Scalable multimodal data pipeline for SOTA foundation models

Created 1 month ago
270 stars

Top 95.4% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Mega Data Factory is a distributed, high-throughput data processing pipeline designed for web-scale multimodal datasets (hundreds of billions of items). It targets engineers and researchers building state-of-the-art foundation models, enabling reproducible, efficient processing, ablation studies, quality scoring, and deduplication across text, image, and vision-language data. The pipeline leverages Ray for distributed execution, with Rust-accelerated operators for CPU-bound tasks and GPU-optimized components for deep learning inference.

How It Works

The system employs a pipeline-parallel architecture orchestrated by Ray, utilizing ObjectRef chaining for concurrent stage execution and backpressure control to manage large datasets. Key operations like text extraction, image quality assessment, and perceptual hashing are accelerated using Rust, offering significant speedups. GPU-optimized operators handle computationally intensive tasks such as CLIP and SigLIP embedding generation. Data processing is configured via YAML files, allowing users to define complex workflows involving various loaders, refiners, filters, and deduplicators.

Quick Start & Requirements

Installation involves cloning the repository and installing dependencies using uv pip install -e .. A Rust toolchain (installable via rustup) is required for building the accelerated operators. GPU resources are necessary for specific embedding and scoring operators. Pipeline runs are initiated via the mdf run command, using configuration files (e.g., configs/z_image.yaml). Interactive reports for pipeline runs are available at https://huggingface.co/spaces/classtag/mega-data-factory-reports.

Highlighted Details

  • Rust Acceleration: Offers 10-25x speedups for operators like image quality assessment, perceptual hashing (ImagePhashDeduplicator), and text extraction (CommonCrawlLoader).
  • GPU Optimization: Efficiently generates CLIP and SigLIP embeddings, crucial for multimodal foundation models.
  • Scalability: Designed to handle datasets in the hundreds of billions, with features like bucketed deduplication for distributed state management.
  • Reproducibility: Aims to reproduce SOTA foundation model data pipelines, including support for datasets like RefinedWeb, FineWeb, and LAION-5B.
  • Performance: Achieves high throughput, with text processing reaching ~20k records/sec on 8 CPU cores and image embedding processing at ~132 records/sec on a GPU.

Maintenance & Community

The project is primarily authored by Duo An. No specific community channels (e.g., Discord, Slack) or external sponsorships are mentioned in the provided README.

Licensing & Compatibility

The project is released under the MIT License, which permits commercial use and modification with attribution.

Limitations & Caveats

Several planned and in-progress pipelines are listed, indicating that the project is under active development and not all envisioned capabilities are fully implemented. The use of Rust acceleration requires a compatible toolchain, and GPU resources are essential for certain advanced operators.

Health Check
Last Commit

21 hours ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
8
Star History
280 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
2 more.

towhee by towhee-io

0%
3k
Framework for neural data processing pipelines
Created 4 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Alexander Wettig Alexander Wettig(Coauthor of SWE-bench, SWE-agent), and
5 more.

data-juicer by datajuicer

0.5%
6k
Data-Juicer: Data processing system for foundation models
Created 2 years ago
Updated 1 day ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
26 more.

datasets by huggingface

0.1%
21k
Access and process large AI datasets efficiently
Created 6 years ago
Updated 1 day ago
Feedback? Help us improve.