Dataset pipeline for training large language models
Top 10.6% on sourcepulse
This repository provides the code and methodology for creating the RedPajama-V2 dataset, a 30 trillion token corpus designed for training large language models. It targets researchers and developers building LLMs, offering a comprehensive, open dataset with extensive quality annotations and deduplication.
How It Works
The pipeline processes Common Crawl data through a multi-stage approach: artifact preparation (including classifier training and blacklist fetching), quality signal computation (using CCNet and ML heuristics), and deduplication (exact via Bloom filters and fuzzy via LSH). This modular design allows for flexibility and detailed data analysis, enabling users to select specific quality signals or deduplication levels.
Quick Start & Requirements
configs/rp_v2.0.conf
.scripts/run_prep_artifacts.sh
, scripts/apptainer_run_quality_signals.sh
, scripts/apptainer_run_lsh.sh
).Highlighted Details
Maintenance & Community
The project acknowledges contributions from AI2 (OLMo team), OpenGPT-X, Cerebras, and EleutherAI. It builds upon the work of The Pile and cites numerous academic institutions and research groups as partners for RedPajama-v1.
Licensing & Compatibility
Licensed under the Apache License, Version 2.0. The dataset itself is subject to the Common Crawl Foundation Terms of Use. Compatible with commercial use under Apache 2.0 terms.
Limitations & Caveats
The README implies a dependency on specific containerization tools (Docker, Apptainer) and external data sources (LDNOOBW, UT1 blacklist). Running the pipeline without containers requires careful management of the PYTHONHASHSEED
environment variable for consistency.
7 months ago
Inactive