data-juicer  by modelscope

Data-Juicer: Data processing system for foundation models

Created 2 years ago
5,453 stars

Top 9.2% on SourcePulse

GitHubView on GitHub
Project Summary

Data-Juicer is a comprehensive system for processing text and multimodal data tailored for foundation models, particularly LLMs. It offers a systematic, reusable, and extensible library of over 100 operators and 50+ data recipes, enabling efficient data analysis, cleaning, and synthesis for various stages of model development, from pre-training to post-tuning.

How It Works

Data-Juicer employs a modular operator-based architecture, allowing users to chain together various data processing steps. It supports distributed processing via Ray for cloud-scale operations and includes a "Sandbox" environment for iterative data-model co-development and rapid experimentation. This approach facilitates efficient, robust, and effect-proven data optimization, proven in large-scale production environments.

Quick Start & Requirements

  • Install: pip install py-data-juicer (basic functions only) or install from source for full features.
  • Prerequisites: Python >= 3.9, <= 3.10; gcc >= 5 (C++14 support). For video operators, FFmpeg must be installed and in the PATH.
  • Setup: Installing from source with optional dependencies can be time-consuming. Refer to the DJ-Cookbook for detailed guides and interactive examples.

Highlighted Details

  • Supports text, image, audio, and video data processing.
  • Features a "Sandbox" for data-model co-development and rapid iteration.
  • Offers distributed data processing capabilities using Ray for cloud-scale operations.
  • Includes over 100 operators and 50+ reusable data recipes.

Maintenance & Community

The project is actively maintained by the Data-Juicer Team, with frequent updates and new features. They welcome community contributions via issues, PRs, and a Slack channel. Notable integrations include Alibaba Cloud's Platform for AI (PAI).

Licensing & Compatibility

Released under the Apache License 2.0, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

The pip installation provides only basic APIs; full functionality requires installation from source. Some operators may have significant third-party dependencies. Preprocessing complex raw data formats (e.g., nested archives, PDFs) might require custom tools.

Health Check
Last Commit

10 hours ago

Responsiveness

1 day

Pull Requests (30d)
9
Issues (30d)
4
Star History
191 stars in the last 30 days

Explore Similar Projects

Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
12 more.

datatrove by huggingface

0.5%
3k
Data processing library for large-scale text data
Created 2 years ago
Updated 3 weeks ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
26 more.

datasets by huggingface

0.1%
21k
Access and process large AI datasets efficiently
Created 5 years ago
Updated 1 day ago
Feedback? Help us improve.