Craw4LLM  by cxcscmu

Web crawler for LLM pretraining research

created 5 months ago
634 stars

Top 53.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the official implementation for Craw4LLM, a system designed for efficient web crawling specifically tailored for Large Language Model (LLM) pretraining. It targets researchers and engineers working on LLM data pipelines, offering a method to select high-quality documents for training.

How It Works

Craw4LLM employs a document scoring and selection mechanism to prioritize relevant content. It utilizes a combination of scoring methods, including document length and a fastText classifier trained on specific datasets (e.g., Reddit ELI5 vs. RW), to rank documents. The selection_method parameter in the configuration determines the final ranking criteria, allowing for strategies like using the fastText score or random selection. This approach aims to improve the quality and efficiency of web data collection for LLM pretraining compared to random or simple in-link count methods.

Quick Start & Requirements

  • Install: Create a Python >= 3.10 virtual environment and install requirements: numpy, tqdm, fasttext, pyyaml, wandb.
  • Prerequisites: Requires the ClueWeb22 dataset, which should ideally be stored on an SSD for efficient crawling. Download the DCLM fastText classifier model to fasttext_scorers/.
  • Run Crawler: python crawl.py crawl --config <path_to_your_config_file>
  • Configuration: Create YAML files in configs/ specifying cw22_root_path, seed_docs_file, output_dir, num_selected_docs_per_iter, max_num_docs, selection_method, and rating_methods.
  • Fetch Documents: python fetch_docs.py --input_dir <document_ids_dir> --output_dir <document_texts_dir> --num_workers <num_workers>
  • Access Data: python access_data.py <path_to_clueweb22> <document_id>
  • Documentation: [Not explicitly linked, but configuration examples are provided.]

Highlighted Details

  • Supports multiple document selection strategies: DCLM fastText score, random score, and in-link count.
  • Integrates with Weights & Biases (wandb) for experiment tracking.
  • Allows configuration of scoring methods, including custom raters.
  • Provides utilities to fetch document texts and access individual documents by ID.

Maintenance & Community

  • The repository is officially maintained by cxcscmu.
  • No specific community links (Discord, Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The repository itself does not explicitly state a license in the README. The underlying datasets and models used may have their own licenses.

Limitations & Caveats

The primary requirement is access to the ClueWeb22 dataset, which is substantial. The efficiency of the crawler is heavily dependent on the data being stored on an SSD. The README does not detail the specific fastText model used or its licensing.

Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
21 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.