dataverse by UpstageAI

ETL pipeline for LLM data processing

Created 2 years ago

565 stars

Top 56.9% on SourcePulse

View on GitHub

3 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Wing Lian

Founder of Axolotl AI

Philipp Schmid

DevRel at Google DeepMind

Project Summary

Dataverse is an open-source Python library designed to simplify and standardize ETL (Extract, Transform, Load) pipelines, particularly for data scientists and developers working with Large Language Models (LLMs). It provides a block-based, configure-driven approach to data processing, abstracting away the complexities of Apache Spark and enabling easier collaboration and scalability, especially on cloud platforms like AWS EMR.

How It Works

Dataverse utilizes a block-based architecture where each registered ETL function is a "block" that runs on Spark. Users construct pipelines by configuring sequences of these blocks, akin to assembling puzzle pieces. This configuration-driven approach eliminates the need for extensive coding, allowing users to define Spark setups and ETL steps through simple option settings. The framework is also extensible, allowing for custom function integration.

Quick Start & Requirements

Install via pip: pip install dataverse
Prerequisites: Python (3.10-3.11), JDK (version 11), PySpark. Detailed installation guides are available.
Official Docs: https://data-verse.gitbook.io/docs/
Examples: https://github.com/UpstageAI/dataverse/tree/main/examples

Highlighted Details

Supports over 50 registered ETL functions for extraction, transformation (including bias, cleaning, deduplication, PII, quality, toxicity), and loading.
Integrates with AWS S3 for data storage and AWS EMR for distributed pipeline execution.
Offers specific modules for data ingestion from Hugging Face, random sampling, MinHash-based deduplication, and saving to Parquet.
The project is used by Upstage for training models like Solar Mini and for the 1T Token Club initiative.

Maintenance & Community

Orchestrated by the Data-Centric LLM Team at Upstage.
Community support available via Discord: https://discord.gg/aAqF7pyq4h
Paper available: https://arxiv.org/abs/2403.19340

Licensing & Compatibility

Licensed under the Apache-2.0 license.
Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

Some transformation modules like 'bias', 'decontamination', and 'toxicity' are marked as Work In Progress (WIP).
Python version support is limited to 3.10 and 3.11.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days