mega-data-factory by duoan

Scalable multimodal data pipeline for SOTA foundation models

Created 5 months ago

369 stars

Top 76.3% on SourcePulse

Project Summary

Summary

Mega Data Factory is a distributed, high-throughput data processing pipeline designed for web-scale multimodal datasets (hundreds of billions of items). It targets engineers and researchers building state-of-the-art foundation models, enabling reproducible, efficient processing, ablation studies, quality scoring, and deduplication across text, image, and vision-language data. The pipeline leverages Ray for distributed execution, with Rust-accelerated operators for CPU-bound tasks and GPU-optimized components for deep learning inference.

How It Works

The system employs a pipeline-parallel architecture orchestrated by Ray, utilizing ObjectRef chaining for concurrent stage execution and backpressure control to manage large datasets. Key operations like text extraction, image quality assessment, and perceptual hashing are accelerated using Rust, offering significant speedups. GPU-optimized operators handle computationally intensive tasks such as CLIP and SigLIP embedding generation. Data processing is configured via YAML files, allowing users to define complex workflows involving various loaders, refiners, filters, and deduplicators.

Quick Start & Requirements

Installation involves cloning the repository and installing dependencies using uv pip install -e .. A Rust toolchain (installable via rustup) is required for building the accelerated operators. GPU resources are necessary for specific embedding and scoring operators. Pipeline runs are initiated via the mdf run command, using configuration files (e.g., configs/z_image.yaml). Interactive reports for pipeline runs are available at https://huggingface.co/spaces/classtag/mega-data-factory-reports.

Highlighted Details

Rust Acceleration: Offers 10-25x speedups for operators like image quality assessment, perceptual hashing (ImagePhashDeduplicator), and text extraction (CommonCrawlLoader).
GPU Optimization: Efficiently generates CLIP and SigLIP embeddings, crucial for multimodal foundation models.
Scalability: Designed to handle datasets in the hundreds of billions, with features like bucketed deduplication for distributed state management.
Reproducibility: Aims to reproduce SOTA foundation model data pipelines, including support for datasets like RefinedWeb, FineWeb, and LAION-5B.
Performance: Achieves high throughput, with text processing reaching ~20k records/sec on 8 CPU cores and image embedding processing at ~132 records/sec on a GPU.

Maintenance & Community

The project is primarily authored by Duo An. No specific community channels (e.g., Discord, Slack) or external sponsorships are mentioned in the provided README.

Licensing & Compatibility

The project is released under the MIT License, which permits commercial use and modification with attribution.

Limitations & Caveats

Several planned and in-progress pipelines are listed, indicating that the project is under active development and not all envisioned capabilities are fully implemented. The use of Rust acceleration requires a compatible toolchain, and GPU resources are essential for certain advanced operators.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days