Discover and explore top open-source AI tools and projects—updated daily.
duoanScalable multimodal data pipeline for SOTA foundation models
Top 95.4% on SourcePulse
Summary
Mega Data Factory is a distributed, high-throughput data processing pipeline designed for web-scale multimodal datasets (hundreds of billions of items). It targets engineers and researchers building state-of-the-art foundation models, enabling reproducible, efficient processing, ablation studies, quality scoring, and deduplication across text, image, and vision-language data. The pipeline leverages Ray for distributed execution, with Rust-accelerated operators for CPU-bound tasks and GPU-optimized components for deep learning inference.
How It Works
The system employs a pipeline-parallel architecture orchestrated by Ray, utilizing ObjectRef chaining for concurrent stage execution and backpressure control to manage large datasets. Key operations like text extraction, image quality assessment, and perceptual hashing are accelerated using Rust, offering significant speedups. GPU-optimized operators handle computationally intensive tasks such as CLIP and SigLIP embedding generation. Data processing is configured via YAML files, allowing users to define complex workflows involving various loaders, refiners, filters, and deduplicators.
Quick Start & Requirements
Installation involves cloning the repository and installing dependencies using uv pip install -e .. A Rust toolchain (installable via rustup) is required for building the accelerated operators. GPU resources are necessary for specific embedding and scoring operators. Pipeline runs are initiated via the mdf run command, using configuration files (e.g., configs/z_image.yaml). Interactive reports for pipeline runs are available at https://huggingface.co/spaces/classtag/mega-data-factory-reports.
Highlighted Details
ImagePhashDeduplicator), and text extraction (CommonCrawlLoader).Maintenance & Community
The project is primarily authored by Duo An. No specific community channels (e.g., Discord, Slack) or external sponsorships are mentioned in the provided README.
Licensing & Compatibility
The project is released under the MIT License, which permits commercial use and modification with attribution.
Limitations & Caveats
Several planned and in-progress pipelines are listed, indicating that the project is under active development and not all envisioned capabilities are fully implemented. The use of Rust acceleration requires a compatible toolchain, and GPU resources are essential for certain advanced operators.
21 hours ago
Inactive
mlfoundations
towhee-io
datajuicer
Eventual-Inc
huggingface