data-juicer  by modelscope

Data-Juicer: Data processing system for foundation models

Created 2 years ago
5,200 stars

Top 9.6% on SourcePulse

GitHubView on GitHub
Project Summary

Data-Juicer is a comprehensive system for processing text and multimodal data tailored for foundation models, particularly LLMs. It offers a systematic, reusable, and extensible library of over 100 operators and 50+ data recipes, enabling efficient data analysis, cleaning, and synthesis for various stages of model development, from pre-training to post-tuning.

How It Works

Data-Juicer employs a modular operator-based architecture, allowing users to chain together various data processing steps. It supports distributed processing via Ray for cloud-scale operations and includes a "Sandbox" environment for iterative data-model co-development and rapid experimentation. This approach facilitates efficient, robust, and effect-proven data optimization, proven in large-scale production environments.

Quick Start & Requirements

  • Install: pip install py-data-juicer (basic functions only) or install from source for full features.
  • Prerequisites: Python >= 3.9, <= 3.10; gcc >= 5 (C++14 support). For video operators, FFmpeg must be installed and in the PATH.
  • Setup: Installing from source with optional dependencies can be time-consuming. Refer to the DJ-Cookbook for detailed guides and interactive examples.

Highlighted Details

  • Supports text, image, audio, and video data processing.
  • Features a "Sandbox" for data-model co-development and rapid iteration.
  • Offers distributed data processing capabilities using Ray for cloud-scale operations.
  • Includes over 100 operators and 50+ reusable data recipes.

Maintenance & Community

The project is actively maintained by the Data-Juicer Team, with frequent updates and new features. They welcome community contributions via issues, PRs, and a Slack channel. Notable integrations include Alibaba Cloud's Platform for AI (PAI).

Licensing & Compatibility

Released under the Apache License 2.0, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

The pip installation provides only basic APIs; full functionality requires installation from source. Some operators may have significant third-party dependencies. Preprocessing complex raw data formats (e.g., nested archives, PDFs) might require custom tools.

Health Check
Last Commit

23 hours ago

Responsiveness

1 day

Pull Requests (30d)
22
Issues (30d)
15
Star History
197 stars in the last 30 days

Explore Similar Projects

Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
11 more.

datatrove by huggingface

0.9%
3k
Data processing library for large-scale text data
Created 2 years ago
Updated 2 days ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
26 more.

datasets by huggingface

0.1%
21k
Access and process large AI datasets efficiently
Created 5 years ago
Updated 1 day ago
Feedback? Help us improve.