WebText dataset recreation for training GPT-2-like models
Top 48.1% on sourcepulse
This project provides an open-source implementation for scraping and processing the OpenWebText dataset, a large-scale text corpus inspired by OpenAI's WebText. It's designed for researchers and developers working on large language models who need a high-quality, diverse dataset for training. The primary benefit is providing a reproducible and accessible alternative to OpenAI's proprietary dataset.
How It Works
The scraper leverages monthly dumps from pushshift.io, which contain Reddit submissions. It filters these submissions for URLs with a karma threshold (defaulting to +3) to ensure content quality. The process involves extracting URLs, de-duplicating them, downloading raw HTML, and then extracting plain text content. This approach prioritizes speed and efficiency over direct API calls, handling large volumes of data through parallel processing and efficient compression.
Quick Start & Requirements
pipenv install
then pipenv shell
or pip3 install -r requirements.txt
.--n_procs
) is recommended.Highlighted Details
newspaper
or Beautiful Soup 4
.Maintenance & Community
The project is functional but in active development, welcoming issues and pull requests. Links to community channels or roadmaps are not explicitly provided in the README.
Licensing & Compatibility
The repository does not explicitly state a license. Users should verify licensing for commercial use or integration with closed-source projects.
Limitations & Caveats
The project is in active development, and features like BPE encoding are still pending. The README mentions potential hanging issues with the downloader if timeouts are not set.
2 years ago
Inactive