Web scraping library for creating massive datasets
Top 21.1% on sourcepulse
This library provides tools for crawling, cleaning, and deduplicating web pages to construct large, monolingual text datasets, targeting researchers and developers building large language models. It enables the creation of datasets potentially larger than OpenAI's WebText by automating the acquisition and purification of web content.
How It Works
The library employs a multi-stage process: URL acquisition from sources like Reddit dumps and Gutenberg books, followed by deduplication to avoid redundant content. It then downloads webpages, automatically cleaning them by removing HTML tags, decoding UTF-8, transliterating characters, collapsing whitespace, and unescaping HTML. Finally, it uses Bloom filters for efficient, large-scale deduplication of cleaned text content based on n-gram overlap, allowing users to filter out significantly overlapping documents.
Quick Start & Requirements
pip3 install -r requirements.txt
then pip3 install .
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project's setup involves manual URL acquisition for some sources, and the README does not specify a license, potentially impacting commercial use or integration into closed-source projects. There is no explicit mention of ongoing maintenance or community support.
4 years ago
1+ week