data-prep-kit  by data-prep-kit

Data preparation toolkit for GenAI applications

created 1 year ago
754 stars

Top 47.1% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Data Prep Kit is an open-source toolkit designed to streamline the preparation of unstructured data for Generative AI applications, including LLM pre-training, fine-tuning, and RAG systems. It targets LLM developers and researchers, offering a scalable solution that can adapt from laptop to data center environments.

How It Works

The kit provides a modular framework with a growing set of transforms for data cleansing, transformation, and enrichment. It supports Natural Language and Code data modalities, leveraging Python, Ray, and Spark runtimes for scalable processing. Workflows are automated using Kubeflow Pipelines, enabling the creation of complex data preparation pipelines.

Quick Start & Requirements

  • Installation: pip install 'data-prep-toolkit-transforms[all]'
  • Prerequisites: Python 3.10, 3.11, or 3.12.
  • Getting Started: A Google Colab notebook is available for immediate testing: examples/notebooks/Run_your_first_transform_colab.ipynb.
  • Documentation: https://data-prep-kit.github.io/data-prep-kit/

Highlighted Details

  • Supports multiple runtimes: Python-only, Ray, Spark, and KFP on Ray.
  • Offers a wide array of transforms for data ingestion, universal processing (deduplication, profiling, PII redaction), language-specific tasks (chunking, language ID), and code-specific tasks (programming language annotation, license selection).
  • Includes a framework for developing and contributing custom transforms.
  • Recognized by LF AI & Data and adheres to OpenSSF Best Practices.

Maintenance & Community

The project is hosted by the LF AI & Data Foundation and originated from IBM Research. Community engagement is encouraged via the discussion section.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

While supporting multiple runtimes, the availability of specific transforms varies across Python, Ray, and Spark. The project is actively developing, with new modalities and runtime support being added.

Health Check
Last commit

2 days ago

Responsiveness

1 week

Pull Requests (30d)
25
Issues (30d)
24
Star History
124 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alexander Wettig Alexander Wettig(Author of SWE-bench, SWE-agent), and
2 more.

data-juicer by modelscope

0.7%
5k
Data-Juicer: Data processing system for foundation models
created 2 years ago
updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

towhee by towhee-io

0.2%
3k
Framework for neural data processing pipelines
created 4 years ago
updated 9 months ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

argilla by argilla-io

0.4%
5k
Collaboration tool for building high-quality AI datasets
created 4 years ago
updated 5 days ago
Feedback? Help us improve.