Web crawler for LLM pretraining research
Top 53.3% on sourcepulse
This repository provides the official implementation for Craw4LLM, a system designed for efficient web crawling specifically tailored for Large Language Model (LLM) pretraining. It targets researchers and engineers working on LLM data pipelines, offering a method to select high-quality documents for training.
How It Works
Craw4LLM employs a document scoring and selection mechanism to prioritize relevant content. It utilizes a combination of scoring methods, including document length and a fastText classifier trained on specific datasets (e.g., Reddit ELI5 vs. RW), to rank documents. The selection_method
parameter in the configuration determines the final ranking criteria, allowing for strategies like using the fastText score or random selection. This approach aims to improve the quality and efficiency of web data collection for LLM pretraining compared to random or simple in-link count methods.
Quick Start & Requirements
numpy
, tqdm
, fasttext
, pyyaml
, wandb
.fasttext_scorers/
.python crawl.py crawl --config <path_to_your_config_file>
configs/
specifying cw22_root_path
, seed_docs_file
, output_dir
, num_selected_docs_per_iter
, max_num_docs
, selection_method
, and rating_methods
.python fetch_docs.py --input_dir <document_ids_dir> --output_dir <document_texts_dir> --num_workers <num_workers>
python access_data.py <path_to_clueweb22> <document_id>
Highlighted Details
wandb
) for experiment tracking.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The primary requirement is access to the ClueWeb22 dataset, which is substantial. The efficiency of the crawler is heavily dependent on the data being stored on an SSD. The README does not detail the specific fastText model used or its licensing.
5 months ago
1 day