openwebtext  by jcpeterson

WebText dataset recreation for training GPT-2-like models

created 6 years ago
735 stars

Top 48.1% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an open-source implementation for scraping and processing the OpenWebText dataset, a large-scale text corpus inspired by OpenAI's WebText. It's designed for researchers and developers working on large language models who need a high-quality, diverse dataset for training. The primary benefit is providing a reproducible and accessible alternative to OpenAI's proprietary dataset.

How It Works

The scraper leverages monthly dumps from pushshift.io, which contain Reddit submissions. It filters these submissions for URLs with a karma threshold (defaulting to +3) to ensure content quality. The process involves extracting URLs, de-duplicating them, downloading raw HTML, and then extracting plain text content. This approach prioritizes speed and efficiency over direct API calls, handling large volumes of data through parallel processing and efficient compression.

Quick Start & Requirements

  • Install dependencies: pipenv install then pipenv shell or pip3 install -r requirements.txt.
  • Download pushshift.io dumps or use pre-filtered URL lists (2GB).
  • Scraping requires significant compute and bandwidth; parallel processing (--n_procs) is recommended.
  • Official quick-start and documentation are available within the repository.

Highlighted Details

  • Replicates OpenAI's WebText dataset using pushshift.io data.
  • Processes over 23 million URLs and 10 million HTML pages.
  • Offers options for scraping raw HTML or extracting text directly using newspaper or Beautiful Soup 4.
  • Includes utilities for URL extraction, de-duplication, and text tokenization.

Maintenance & Community

The project is functional but in active development, welcoming issues and pull requests. Links to community channels or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. Users should verify licensing for commercial use or integration with closed-source projects.

Limitations & Caveats

The project is in active development, and features like BPE encoding are still pending. The README mentions potential hanging issues with the downloader if timeouts are not set.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

1.9%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 1 day ago
Feedback? Help us improve.