RedPajama-Data  by togethercomputer

Dataset pipeline for training large language models

Created 2 years ago
4,809 stars

Top 10.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the code and methodology for creating the RedPajama-V2 dataset, a 30 trillion token corpus designed for training large language models. It targets researchers and developers building LLMs, offering a comprehensive, open dataset with extensive quality annotations and deduplication.

How It Works

The pipeline processes Common Crawl data through a multi-stage approach: artifact preparation (including classifier training and blacklist fetching), quality signal computation (using CCNet and ML heuristics), and deduplication (exact via Bloom filters and fuzzy via LSH). This modular design allows for flexibility and detailed data analysis, enabling users to select specific quality signals or deduplication levels.

Quick Start & Requirements

  • Installation: Requires Docker and Apptainer. Configuration involves copying and modifying configs/rp_v2.0.conf.
  • Prerequisites: S5cmd for S3 access, Python environment with specific dependencies (not explicitly listed but implied by scripts). A pre-trained fastText classifier for Wikipedia references is needed.
  • Execution: The pipeline is run via shell scripts (scripts/run_prep_artifacts.sh, scripts/apptainer_run_quality_signals.sh, scripts/apptainer_run_lsh.sh).
  • Resources: Processing 200M documents on a 64-core machine with 500GB RAM is mentioned for LSH.
  • Documentation: Blog post and HuggingFace dataset page are linked for more information.

Highlighted Details

  • Offers 30T tokens across 20.8B documents in multiple languages (English, German, French, Italian, Spanish).
  • Includes over 50 distinct quality signals, categorized into CCNet, ML Heuristics, Natural Language, Repetitiveness, and Toxicity.
  • Implements both exact (Bloom filter) and fuzzy (LSH) deduplication methods.
  • Provides banded MinHash signatures for various Jaccard similarity thresholds (0.7 to 1.0).

Maintenance & Community

The project acknowledges contributions from AI2 (OLMo team), OpenGPT-X, Cerebras, and EleutherAI. It builds upon the work of The Pile and cites numerous academic institutions and research groups as partners for RedPajama-v1.

Licensing & Compatibility

Licensed under the Apache License, Version 2.0. The dataset itself is subject to the Common Crawl Foundation Terms of Use. Compatible with commercial use under Apache 2.0 terms.

Limitations & Caveats

The README implies a dependency on specific containerization tools (Docker, Apptainer) and external data sources (LDNOOBW, UT1 blacklist). Running the pipeline without containers requires careful management of the PYTHONHASHSEED environment variable for consistency.

Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
1 more.

cosmopedia by huggingface

0.4%
540
Synthetic dataset for LLM training
Created 1 year ago
Updated 10 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
4 more.

dolma by allenai

0.2%
1k
Toolkit for curating datasets for language model pre-training
Created 2 years ago
Updated 2 days ago
Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
11 more.

datatrove by huggingface

0.9%
3k
Data processing library for large-scale text data
Created 2 years ago
Updated 2 days ago
Feedback? Help us improve.