dataverse  by UpstageAI

ETL pipeline for LLM data processing

created 1 year ago
554 stars

Top 58.7% on sourcepulse

GitHubView on GitHub
Project Summary

Dataverse is an open-source Python library designed to simplify and standardize ETL (Extract, Transform, Load) pipelines, particularly for data scientists and developers working with Large Language Models (LLMs). It provides a block-based, configure-driven approach to data processing, abstracting away the complexities of Apache Spark and enabling easier collaboration and scalability, especially on cloud platforms like AWS EMR.

How It Works

Dataverse utilizes a block-based architecture where each registered ETL function is a "block" that runs on Spark. Users construct pipelines by configuring sequences of these blocks, akin to assembling puzzle pieces. This configuration-driven approach eliminates the need for extensive coding, allowing users to define Spark setups and ETL steps through simple option settings. The framework is also extensible, allowing for custom function integration.

Quick Start & Requirements

Highlighted Details

  • Supports over 50 registered ETL functions for extraction, transformation (including bias, cleaning, deduplication, PII, quality, toxicity), and loading.
  • Integrates with AWS S3 for data storage and AWS EMR for distributed pipeline execution.
  • Offers specific modules for data ingestion from Hugging Face, random sampling, MinHash-based deduplication, and saving to Parquet.
  • The project is used by Upstage for training models like Solar Mini and for the 1T Token Club initiative.

Maintenance & Community

Licensing & Compatibility

  • Licensed under the Apache-2.0 license.
  • Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

  • Some transformation modules like 'bias', 'decontamination', and 'toxicity' are marked as Work In Progress (WIP).
  • Python version support is limited to 3.10 and 3.11.
Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems).

bytewax by bytewax

0.3%
2k
Python framework for stateful stream processing
created 3 years ago
updated 4 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alexander Wettig Alexander Wettig(Author of SWE-bench, SWE-agent), and
2 more.

data-juicer by modelscope

0.7%
5k
Data-Juicer: Data processing system for foundation models
created 2 years ago
updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

towhee by towhee-io

0.2%
3k
Framework for neural data processing pipelines
created 4 years ago
updated 9 months ago
Feedback? Help us improve.