data-prep-kit  by data-prep-kit

Data preparation toolkit for GenAI applications

Created 1 year ago
800 stars

Top 44.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Data Prep Kit is an open-source toolkit designed to streamline the preparation of unstructured data for Generative AI applications, including LLM pre-training, fine-tuning, and RAG systems. It targets LLM developers and researchers, offering a scalable solution that can adapt from laptop to data center environments.

How It Works

The kit provides a modular framework with a growing set of transforms for data cleansing, transformation, and enrichment. It supports Natural Language and Code data modalities, leveraging Python, Ray, and Spark runtimes for scalable processing. Workflows are automated using Kubeflow Pipelines, enabling the creation of complex data preparation pipelines.

Quick Start & Requirements

  • Installation: pip install 'data-prep-toolkit-transforms[all]'
  • Prerequisites: Python 3.10, 3.11, or 3.12.
  • Getting Started: A Google Colab notebook is available for immediate testing: examples/notebooks/Run_your_first_transform_colab.ipynb.
  • Documentation: https://data-prep-kit.github.io/data-prep-kit/

Highlighted Details

  • Supports multiple runtimes: Python-only, Ray, Spark, and KFP on Ray.
  • Offers a wide array of transforms for data ingestion, universal processing (deduplication, profiling, PII redaction), language-specific tasks (chunking, language ID), and code-specific tasks (programming language annotation, license selection).
  • Includes a framework for developing and contributing custom transforms.
  • Recognized by LF AI & Data and adheres to OpenSSF Best Practices.

Maintenance & Community

The project is hosted by the LF AI & Data Foundation and originated from IBM Research. Community engagement is encouraged via the discussion section.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

While supporting multiple runtimes, the availability of specific transforms varies across Python, Ray, and Spark. The project is actively developing, with new modalities and runtime support being added.

Health Check
Last Commit

2 days ago

Responsiveness

1 week

Pull Requests (30d)
13
Issues (30d)
4
Star History
36 stars in the last 30 days

Explore Similar Projects

Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
11 more.

datatrove by huggingface

0.9%
3k
Data processing library for large-scale text data
Created 2 years ago
Updated 2 days ago
Feedback? Help us improve.