openwebtext by jcpeterson

WebText dataset recreation for training GPT-2-like models

Created 6 years ago

750 stars

Top 46.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Bryan McCann

Cofounder of You.com

Project Summary

This project provides an open-source implementation for scraping and processing the OpenWebText dataset, a large-scale text corpus inspired by OpenAI's WebText. It's designed for researchers and developers working on large language models who need a high-quality, diverse dataset for training. The primary benefit is providing a reproducible and accessible alternative to OpenAI's proprietary dataset.

How It Works

The scraper leverages monthly dumps from pushshift.io, which contain Reddit submissions. It filters these submissions for URLs with a karma threshold (defaulting to +3) to ensure content quality. The process involves extracting URLs, de-duplicating them, downloading raw HTML, and then extracting plain text content. This approach prioritizes speed and efficiency over direct API calls, handling large volumes of data through parallel processing and efficient compression.

Quick Start & Requirements

Install dependencies: pipenv install then pipenv shell or pip3 install -r requirements.txt.
Download pushshift.io dumps or use pre-filtered URL lists (2GB).
Scraping requires significant compute and bandwidth; parallel processing (--n_procs) is recommended.
Official quick-start and documentation are available within the repository.

Highlighted Details

Replicates OpenAI's WebText dataset using pushshift.io data.
Processes over 23 million URLs and 10 million HTML pages.
Offers options for scraping raw HTML or extracting text directly using newspaper or Beautiful Soup 4.
Includes utilities for URL extraction, de-duplication, and text tokenization.

Maintenance & Community

The project is functional but in active development, welcoming issues and pull requests. Links to community channels or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. Users should verify licensing for commercial use or integration with closed-source projects.

Limitations & Caveats

The project is in active development, and features like BPE encoding are still pending. The README mentions potential hanging issues with the downloader if timeouts are not set.

openwebtext by jcpeterson

Explore Similar Projects

llm-reader by m92vyas

scrape-it-now by clemlesne

mcp by hyperbrowserai

openwebtext by yet-another-account

tavily-python by tavily-ai

onefilellm by jimmc414

CyberScraper-2077 by itsOwen

AnyCrawl by any4ai

lazynlp by chiphuyen

trafilatura by adbar

crawlee-python by apify

Scrapegraph-ai by ScrapeGraphAI