data-prep-kit by data-prep-kit

Data preparation toolkit for GenAI applications

Created 1 year ago

888 stars

Top 40.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

Data Prep Kit is an open-source toolkit designed to streamline the preparation of unstructured data for Generative AI applications, including LLM pre-training, fine-tuning, and RAG systems. It targets LLM developers and researchers, offering a scalable solution that can adapt from laptop to data center environments.

How It Works

The kit provides a modular framework with a growing set of transforms for data cleansing, transformation, and enrichment. It supports Natural Language and Code data modalities, leveraging Python, Ray, and Spark runtimes for scalable processing. Workflows are automated using Kubeflow Pipelines, enabling the creation of complex data preparation pipelines.

Quick Start & Requirements

Installation: pip install 'data-prep-toolkit-transforms[all]'
Prerequisites: Python 3.10, 3.11, or 3.12.
Getting Started: A Google Colab notebook is available for immediate testing: examples/notebooks/Run_your_first_transform_colab.ipynb.
Documentation: https://data-prep-kit.github.io/data-prep-kit/

Highlighted Details

Supports multiple runtimes: Python-only, Ray, Spark, and KFP on Ray.
Offers a wide array of transforms for data ingestion, universal processing (deduplication, profiling, PII redaction), language-specific tasks (chunking, language ID), and code-specific tasks (programming language annotation, license selection).
Includes a framework for developing and contributing custom transforms.
Recognized by LF AI & Data and adheres to OpenSSF Best Practices.

Maintenance & Community

The project is hosted by the LF AI & Data Foundation and originated from IBM Research. Community engagement is encouraged via the discussion section.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

While supporting multiple runtimes, the availability of specific transforms varies across Python, Ray, and Spark. The project is actively developing, with new modalities and runtime support being added.

Health Check

Last Commit

2 days ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

27 stars in the last 30 days