RedPajama-Data by togethercomputer

Dataset pipeline for training large language models

Created 2 years ago

4,913 stars

Top 10.1% on SourcePulse

View on GitHub

13 Experts Love This Project

Nat Friedman

Former CEO of GitHub

Alexey Milovidov

Cofounder of Clickhouse

Alexander Wettig

Coauthor of SWE-bench, SWE-agent

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

and 9 more!

Project Summary

This repository provides the code and methodology for creating the RedPajama-V2 dataset, a 30 trillion token corpus designed for training large language models. It targets researchers and developers building LLMs, offering a comprehensive, open dataset with extensive quality annotations and deduplication.

How It Works

The pipeline processes Common Crawl data through a multi-stage approach: artifact preparation (including classifier training and blacklist fetching), quality signal computation (using CCNet and ML heuristics), and deduplication (exact via Bloom filters and fuzzy via LSH). This modular design allows for flexibility and detailed data analysis, enabling users to select specific quality signals or deduplication levels.

Quick Start & Requirements

Installation: Requires Docker and Apptainer. Configuration involves copying and modifying configs/rp_v2.0.conf.
Prerequisites: S5cmd for S3 access, Python environment with specific dependencies (not explicitly listed but implied by scripts). A pre-trained fastText classifier for Wikipedia references is needed.
Execution: The pipeline is run via shell scripts (scripts/run_prep_artifacts.sh, scripts/apptainer_run_quality_signals.sh, scripts/apptainer_run_lsh.sh).
Resources: Processing 200M documents on a 64-core machine with 500GB RAM is mentioned for LSH.
Documentation: Blog post and HuggingFace dataset page are linked for more information.

Highlighted Details

Offers 30T tokens across 20.8B documents in multiple languages (English, German, French, Italian, Spanish).
Includes over 50 distinct quality signals, categorized into CCNet, ML Heuristics, Natural Language, Repetitiveness, and Toxicity.
Implements both exact (Bloom filter) and fuzzy (LSH) deduplication methods.
Provides banded MinHash signatures for various Jaccard similarity thresholds (0.7 to 1.0).

Maintenance & Community

The project acknowledges contributions from AI2 (OLMo team), OpenGPT-X, Cerebras, and EleutherAI. It builds upon the work of The Pile and cites numerous academic institutions and research groups as partners for RedPajama-v1.

Licensing & Compatibility

Licensed under the Apache License, Version 2.0. The dataset itself is subject to the Common Crawl Foundation Terms of Use. Compatible with commercial use under Apache 2.0 terms.

Limitations & Caveats

The README implies a dependency on specific containerization tools (Docker, Apptainer) and external data sources (LDNOOBW, UT1 blacklist). Running the pipeline without containers requires careful management of the PYTHONHASHSEED environment variable for consistency.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

23 stars in the last 30 days