data-preparation by bigscience-workshop

Code for sourcing and cleaning the BigScience ROOTS corpus

Created 3 years ago

318 stars

Top 85.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Philipp Schmid

DevRel at Google DeepMind

Project Summary

This repository provides the code and tools for sourcing and cleaning the BigScience ROOTS corpus, a large-scale multilingual dataset used for training BLOOM models and their tokenizers. It is intended for researchers and engineers working with large language models who need to understand or replicate the data preparation pipeline for massive datasets.

How It Works

The project details a general pipeline for data preparation, encompassing sourcing, cleaning, filtering, and deduplication. It leverages specific code modules for processing the OSCAR dataset, handling crowdsourced data, and creating a tokenizer training dataset. The methodology is further elaborated in the linked paper, offering in-depth insights into the specific cleaning and filtering operations.

Quick Start & Requirements

Install: Not explicitly detailed, but likely requires Python and standard data science libraries.
Prerequisites: Access to the OSCAR dataset and potentially crowdsourced data is implied. Specific versions or hardware requirements are not listed.
Resources: Processing large datasets like ROOTS (1.6TB) will require significant computational resources and storage.
Links: ROOTS dataset paper

Highlighted Details

Code for making the Pseudo-Crawl dataset.
Filtering library used to filter OSCAR.
Code for running preprocessing pipelines on crowdsourced datasets.
Code for making analysis and plots for the paper.

Maintenance & Community

The project is associated with the BigScience initiative, a large collaborative effort. Notable contributors are listed in the citation, indicating a broad community involvement.

Licensing & Compatibility

The repository itself is not explicitly licensed in the provided README. However, the BigScience initiative generally aims for open access. Compatibility for commercial use would depend on the licenses of the underlying datasets used (e.g., OSCAR).

Limitations & Caveats

The README does not provide explicit installation instructions or detailed dependency management, suggesting a reliance on user familiarity with data processing pipelines. The scale of the data implies substantial resource requirements not detailed here.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days