openwebtext  by yet-another-account

Dataset clone of GPT-2 WebText for language model training

Created 6 years ago
391 stars

Top 73.5% on SourcePulse

GitHubView on GitHub
Project Summary

This project aims to replicate OpenAI's GPT-2 WebText dataset, providing an open-source alternative for researchers and developers working with large language models. It offers a community-driven approach to dataset creation, addressing the need for accessible, high-quality training data.

How It Works

The project follows the methodology described in OpenAI's GPT-2 paper, focusing on extracting text from web pages linked from Reddit submissions with at least 3 karma. It leverages a Python-based pipeline that first scrapes URLs from Reddit and then downloads and processes the content from these URLs.

Quick Start & Requirements

  • Install dependencies: pipenv install
  • Get URLs: pipenv run python get_urls.py
  • Download data: pipenv run python download.py
  • Prerequisites: Pipenv, Python 3, Newspaper library. Ubuntu: sudo apt-get install libxml2-dev libxslt-dev. OS X: brew install libxml2 libxslt.

Highlighted Details

  • Replicates OpenAI's GPT-2 WebText dataset methodology.
  • Extracts data from Reddit-linked URLs with specific karma thresholds.
  • Outputs data in {domain}-{sha256 hash of url}.txt format.

Maintenance & Community

This project is marked as "still WIP" (Work In Progress). The README mentions gratitude to jcpeterson for their download code, suggesting potential community involvement or inspiration.

Licensing & Compatibility

The README does not explicitly state a license. Given the project's goal to be an "open clone," users should verify licensing before commercial use or integration into closed-source projects.

Limitations & Caveats

The project is explicitly stated as "still heavily WIP," indicating potential instability, incomplete features, or ongoing development that may lead to breaking changes. The exact size and quality of the dataset are not detailed.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.5%
5k
Python package for web text extraction
Created 6 years ago
Updated 6 days ago
Feedback? Help us improve.