data-juicer  by modelscope

Data-Juicer: Data processing system for foundation models

created 2 years ago
4,904 stars

Top 10.4% on sourcepulse

GitHubView on GitHub
Project Summary

Data-Juicer is a comprehensive system for processing text and multimodal data tailored for foundation models, particularly LLMs. It offers a systematic, reusable, and extensible library of over 100 operators and 50+ data recipes, enabling efficient data analysis, cleaning, and synthesis for various stages of model development, from pre-training to post-tuning.

How It Works

Data-Juicer employs a modular operator-based architecture, allowing users to chain together various data processing steps. It supports distributed processing via Ray for cloud-scale operations and includes a "Sandbox" environment for iterative data-model co-development and rapid experimentation. This approach facilitates efficient, robust, and effect-proven data optimization, proven in large-scale production environments.

Quick Start & Requirements

  • Install: pip install py-data-juicer (basic functions only) or install from source for full features.
  • Prerequisites: Python >= 3.9, <= 3.10; gcc >= 5 (C++14 support). For video operators, FFmpeg must be installed and in the PATH.
  • Setup: Installing from source with optional dependencies can be time-consuming. Refer to the DJ-Cookbook for detailed guides and interactive examples.

Highlighted Details

  • Supports text, image, audio, and video data processing.
  • Features a "Sandbox" for data-model co-development and rapid iteration.
  • Offers distributed data processing capabilities using Ray for cloud-scale operations.
  • Includes over 100 operators and 50+ reusable data recipes.

Maintenance & Community

The project is actively maintained by the Data-Juicer Team, with frequent updates and new features. They welcome community contributions via issues, PRs, and a Slack channel. Notable integrations include Alibaba Cloud's Platform for AI (PAI).

Licensing & Compatibility

Released under the Apache License 2.0, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

The pip installation provides only basic APIs; full functionality requires installation from source. Some operators may have significant third-party dependencies. Preprocessing complex raw data formats (e.g., nested archives, PDFs) might require custom tools.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
21
Issues (30d)
11
Star History
605 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

towhee by towhee-io

0.2%
3k
Framework for neural data processing pipelines
created 4 years ago
updated 9 months ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

argilla by argilla-io

0.4%
5k
Collaboration tool for building high-quality AI datasets
created 4 years ago
updated 5 days ago
Feedback? Help us improve.