openwebtext by yet-another-account

Dataset clone of GPT-2 WebText for language model training

Created 6 years ago

392 stars

Top 73.4% on SourcePulse

View on GitHub

2 Experts Love This Project

Tom Brown

Cofounder of Anthropic

Anastasis Germanidis

Cofounder of Runway

Project Summary

This project aims to replicate OpenAI's GPT-2 WebText dataset, providing an open-source alternative for researchers and developers working with large language models. It offers a community-driven approach to dataset creation, addressing the need for accessible, high-quality training data.

How It Works

The project follows the methodology described in OpenAI's GPT-2 paper, focusing on extracting text from web pages linked from Reddit submissions with at least 3 karma. It leverages a Python-based pipeline that first scrapes URLs from Reddit and then downloads and processes the content from these URLs.

Quick Start & Requirements

Install dependencies: pipenv install
Get URLs: pipenv run python get_urls.py
Download data: pipenv run python download.py
Prerequisites: Pipenv, Python 3, Newspaper library. Ubuntu: sudo apt-get install libxml2-dev libxslt-dev. OS X: brew install libxml2 libxslt.

Highlighted Details

Replicates OpenAI's GPT-2 WebText dataset methodology.
Extracts data from Reddit-linked URLs with specific karma thresholds.
Outputs data in {domain}-{sha256 hash of url}.txt format.

Maintenance & Community

This project is marked as "still WIP" (Work In Progress). The README mentions gratitude to jcpeterson for their download code, suggesting potential community involvement or inspiration.

Licensing & Compatibility

The README does not explicitly state a license. Given the project's goal to be an "open clone," users should verify licensing before commercial use or integration into closed-source projects.

Limitations & Caveats

The project is explicitly stated as "still heavily WIP," indicating potential instability, incomplete features, or ongoing development that may lead to breaking changes. The exact size and quality of the dataset are not detailed.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days