Craw4LLM by cxcscmu

Web crawler for LLM pretraining research

Created 10 months ago

650 stars

Top 51.4% on SourcePulse

Project Summary

This repository provides the official implementation for Craw4LLM, a system designed for efficient web crawling specifically tailored for Large Language Model (LLM) pretraining. It targets researchers and engineers working on LLM data pipelines, offering a method to select high-quality documents for training.

How It Works

Craw4LLM employs a document scoring and selection mechanism to prioritize relevant content. It utilizes a combination of scoring methods, including document length and a fastText classifier trained on specific datasets (e.g., Reddit ELI5 vs. RW), to rank documents. The selection_method parameter in the configuration determines the final ranking criteria, allowing for strategies like using the fastText score or random selection. This approach aims to improve the quality and efficiency of web data collection for LLM pretraining compared to random or simple in-link count methods.

Quick Start & Requirements

Install: Create a Python >= 3.10 virtual environment and install requirements: numpy, tqdm, fasttext, pyyaml, wandb.
Prerequisites: Requires the ClueWeb22 dataset, which should ideally be stored on an SSD for efficient crawling. Download the DCLM fastText classifier model to fasttext_scorers/.
Run Crawler: python crawl.py crawl --config <path_to_your_config_file>
Configuration: Create YAML files in configs/ specifying cw22_root_path, seed_docs_file, output_dir, num_selected_docs_per_iter, max_num_docs, selection_method, and rating_methods.
Fetch Documents: python fetch_docs.py --input_dir <document_ids_dir> --output_dir <document_texts_dir> --num_workers <num_workers>
Access Data: python access_data.py <path_to_clueweb22> <document_id>
Documentation: [Not explicitly linked, but configuration examples are provided.]

Highlighted Details

Supports multiple document selection strategies: DCLM fastText score, random score, and in-link count.
Integrates with Weights & Biases (wandb) for experiment tracking.
Allows configuration of scoring methods, including custom raters.
Provides utilities to fetch document texts and access individual documents by ID.

Maintenance & Community

The repository is officially maintained by cxcscmu.
No specific community links (Discord, Slack) or roadmap are provided in the README.

Licensing & Compatibility

The repository itself does not explicitly state a license in the README. The underlying datasets and models used may have their own licenses.

Limitations & Caveats

The primary requirement is access to the ClueWeb22 dataset, which is substantial. The efficiency of the crawler is heavily dependent on the data being stored on an SSD. The README does not detail the specific fastText model used or its licensing.

Craw4LLM by cxcscmu

Explore Similar Projects

MS-MARCO-Web-Search by microsoft

multimodal-search-r1 by EvolvingLMMs-Lab

Web-LLM-Assistant-Llamacpp-Ollama by TheBlewish

markdown-crawler by paulpierre

FLARE by jzbjyb

tavily-python by tavily-ai

docbao by hailoc12

Local_Pdf_Chat_RAG by weiwill88

semantra by freedmand

OpenDeepResearcher by mshumer

local-deep-researcher by langchain-ai

wiseflow by TeamWiseFlow