Dataset clone of GPT-2 WebText for language model training
Top 74.7% on sourcepulse
This project aims to replicate OpenAI's GPT-2 WebText dataset, providing an open-source alternative for researchers and developers working with large language models. It offers a community-driven approach to dataset creation, addressing the need for accessible, high-quality training data.
How It Works
The project follows the methodology described in OpenAI's GPT-2 paper, focusing on extracting text from web pages linked from Reddit submissions with at least 3 karma. It leverages a Python-based pipeline that first scrapes URLs from Reddit and then downloads and processes the content from these URLs.
Quick Start & Requirements
pipenv install
pipenv run python get_urls.py
pipenv run python download.py
sudo apt-get install libxml2-dev libxslt-dev
. OS X: brew install libxml2 libxslt
.Highlighted Details
{domain}-{sha256 hash of url}.txt
format.Maintenance & Community
This project is marked as "still WIP" (Work In Progress). The README mentions gratitude to jcpeterson
for their download code, suggesting potential community involvement or inspiration.
Licensing & Compatibility
The README does not explicitly state a license. Given the project's goal to be an "open clone," users should verify licensing before commercial use or integration into closed-source projects.
Limitations & Caveats
The project is explicitly stated as "still heavily WIP," indicating potential instability, incomplete features, or ongoing development that may lead to breaking changes. The exact size and quality of the dataset are not detailed.
1 year ago
Inactive