Data preparation toolkit for GenAI applications
Top 47.1% on sourcepulse
Data Prep Kit is an open-source toolkit designed to streamline the preparation of unstructured data for Generative AI applications, including LLM pre-training, fine-tuning, and RAG systems. It targets LLM developers and researchers, offering a scalable solution that can adapt from laptop to data center environments.
How It Works
The kit provides a modular framework with a growing set of transforms for data cleansing, transformation, and enrichment. It supports Natural Language and Code data modalities, leveraging Python, Ray, and Spark runtimes for scalable processing. Workflows are automated using Kubeflow Pipelines, enabling the creation of complex data preparation pipelines.
Quick Start & Requirements
pip install 'data-prep-toolkit-transforms[all]'
examples/notebooks/Run_your_first_transform_colab.ipynb
.Highlighted Details
Maintenance & Community
The project is hosted by the LF AI & Data Foundation and originated from IBM Research. Community engagement is encouraged via the discussion section.
Licensing & Compatibility
Limitations & Caveats
While supporting multiple runtimes, the availability of specific transforms varies across Python, Ray, and Spark. The project is actively developing, with new modalities and runtime support being added.
2 days ago
1 week