openwebtext  by yet-another-account

Dataset clone of GPT-2 WebText for language model training

created 6 years ago
390 stars

Top 74.7% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project aims to replicate OpenAI's GPT-2 WebText dataset, providing an open-source alternative for researchers and developers working with large language models. It offers a community-driven approach to dataset creation, addressing the need for accessible, high-quality training data.

How It Works

The project follows the methodology described in OpenAI's GPT-2 paper, focusing on extracting text from web pages linked from Reddit submissions with at least 3 karma. It leverages a Python-based pipeline that first scrapes URLs from Reddit and then downloads and processes the content from these URLs.

Quick Start & Requirements

  • Install dependencies: pipenv install
  • Get URLs: pipenv run python get_urls.py
  • Download data: pipenv run python download.py
  • Prerequisites: Pipenv, Python 3, Newspaper library. Ubuntu: sudo apt-get install libxml2-dev libxslt-dev. OS X: brew install libxml2 libxslt.

Highlighted Details

  • Replicates OpenAI's GPT-2 WebText dataset methodology.
  • Extracts data from Reddit-linked URLs with specific karma thresholds.
  • Outputs data in {domain}-{sha256 hash of url}.txt format.

Maintenance & Community

This project is marked as "still WIP" (Work In Progress). The README mentions gratitude to jcpeterson for their download code, suggesting potential community involvement or inspiration.

Licensing & Compatibility

The README does not explicitly state a license. Given the project's goal to be an "open clone," users should verify licensing before commercial use or integration into closed-source projects.

Limitations & Caveats

The project is explicitly stated as "still heavily WIP," indicating potential instability, incomplete features, or ongoing development that may lead to breaking changes. The exact size and quality of the dataset are not detailed.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.