RedPajama-Data  by togethercomputer

Dataset pipeline for training large language models

created 2 years ago
4,783 stars

Top 10.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the code and methodology for creating the RedPajama-V2 dataset, a 30 trillion token corpus designed for training large language models. It targets researchers and developers building LLMs, offering a comprehensive, open dataset with extensive quality annotations and deduplication.

How It Works

The pipeline processes Common Crawl data through a multi-stage approach: artifact preparation (including classifier training and blacklist fetching), quality signal computation (using CCNet and ML heuristics), and deduplication (exact via Bloom filters and fuzzy via LSH). This modular design allows for flexibility and detailed data analysis, enabling users to select specific quality signals or deduplication levels.

Quick Start & Requirements

  • Installation: Requires Docker and Apptainer. Configuration involves copying and modifying configs/rp_v2.0.conf.
  • Prerequisites: S5cmd for S3 access, Python environment with specific dependencies (not explicitly listed but implied by scripts). A pre-trained fastText classifier for Wikipedia references is needed.
  • Execution: The pipeline is run via shell scripts (scripts/run_prep_artifacts.sh, scripts/apptainer_run_quality_signals.sh, scripts/apptainer_run_lsh.sh).
  • Resources: Processing 200M documents on a 64-core machine with 500GB RAM is mentioned for LSH.
  • Documentation: Blog post and HuggingFace dataset page are linked for more information.

Highlighted Details

  • Offers 30T tokens across 20.8B documents in multiple languages (English, German, French, Italian, Spanish).
  • Includes over 50 distinct quality signals, categorized into CCNet, ML Heuristics, Natural Language, Repetitiveness, and Toxicity.
  • Implements both exact (Bloom filter) and fuzzy (LSH) deduplication methods.
  • Provides banded MinHash signatures for various Jaccard similarity thresholds (0.7 to 1.0).

Maintenance & Community

The project acknowledges contributions from AI2 (OLMo team), OpenGPT-X, Cerebras, and EleutherAI. It builds upon the work of The Pile and cites numerous academic institutions and research groups as partners for RedPajama-v1.

Licensing & Compatibility

Licensed under the Apache License, Version 2.0. The dataset itself is subject to the Common Crawl Foundation Terms of Use. Compatible with commercial use under Apache 2.0 terms.

Limitations & Caveats

The README implies a dependency on specific containerization tools (Docker, Apptainer) and external data sources (LDNOOBW, UT1 blacklist). Running the pipeline without containers requires careful management of the PYTHONHASHSEED environment variable for consistency.

Health Check
Last commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
87 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.