data-juicer by datajuicer

Data-Juicer: Data processing system for foundation models

Created 2 years ago

5,921 stars

Top 8.5% on SourcePulse

View on GitHub

7 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Alexander Wettig

Coauthor of SWE-bench, SWE-agent

Junyang Lin

Core Maintainer at Alibaba Qwen

Jeremy Howard

Cofounder of fast.ai

and 3 more!

Project Summary

Data-Juicer is a comprehensive system for processing text and multimodal data tailored for foundation models, particularly LLMs. It offers a systematic, reusable, and extensible library of over 100 operators and 50+ data recipes, enabling efficient data analysis, cleaning, and synthesis for various stages of model development, from pre-training to post-tuning.

How It Works

Data-Juicer employs a modular operator-based architecture, allowing users to chain together various data processing steps. It supports distributed processing via Ray for cloud-scale operations and includes a "Sandbox" environment for iterative data-model co-development and rapid experimentation. This approach facilitates efficient, robust, and effect-proven data optimization, proven in large-scale production environments.

Quick Start & Requirements

Install: pip install py-data-juicer (basic functions only) or install from source for full features.
Prerequisites: Python >= 3.9, <= 3.10; gcc >= 5 (C++14 support). For video operators, FFmpeg must be installed and in the PATH.
Setup: Installing from source with optional dependencies can be time-consuming. Refer to the DJ-Cookbook for detailed guides and interactive examples.

Highlighted Details

Supports text, image, audio, and video data processing.
Features a "Sandbox" for data-model co-development and rapid iteration.
Offers distributed data processing capabilities using Ray for cloud-scale operations.
Includes over 100 operators and 50+ reusable data recipes.

Maintenance & Community

The project is actively maintained by the Data-Juicer Team, with frequent updates and new features. They welcome community contributions via issues, PRs, and a Slack channel. Notable integrations include Alibaba Cloud's Platform for AI (PAI).

Licensing & Compatibility

Released under the Apache License 2.0, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

The pip installation provides only basic APIs; full functionality requires installation from source. Some operators may have significant third-party dependencies. Preprocessing complex raw data formats (e.g., nested archives, PDFs) might require custom tools.

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

158 stars in the last 30 days