lazynlp by chiphuyen

Web scraping library for creating massive datasets

Created 6 years ago

2,226 stars

Top 20.2% on SourcePulse

View on GitHub

4 Experts Love This Project

Elvis Saravia

Founder of DAIR.AI

Thomas Wolf

Cofounder of Hugging Face

Project Summary

This library provides tools for crawling, cleaning, and deduplicating web pages to construct large, monolingual text datasets, targeting researchers and developers building large language models. It enables the creation of datasets potentially larger than OpenAI's WebText by automating the acquisition and purification of web content.

How It Works

The library employs a multi-stage process: URL acquisition from sources like Reddit dumps and Gutenberg books, followed by deduplication to avoid redundant content. It then downloads webpages, automatically cleaning them by removing HTML tags, decoding UTF-8, transliterating characters, collapsing whitespace, and unescaping HTML. Finally, it uses Bloom filters for efficient, large-scale deduplication of cleaned text content based on n-gram overlap, allowing users to filter out significantly overlapping documents.

Quick Start & Requirements

Install: pip3 install -r requirements.txt then pip3 install .
Requirements: Python 3.
Links: Reddit URLs, Gutenberg URLs

Highlighted Details

Enables creation of datasets exceeding 50GB of pure text.
Processes 1GB of text in approximately 3 hours when running 30 scripts in parallel.
Includes built-in exclusion lists for scraper-unfriendly domains and extensions.
Offers configurable n-gram granularity ('char' or 'word') for deduplication.

Maintenance & Community

The project appears to be a personal project by Chip Huyen, with no explicit mention of active maintenance or community channels.

Licensing & Compatibility

The README does not explicitly state a license.

Limitations & Caveats

The project's setup involves manual URL acquisition for some sources, and the README does not specify a license, potentially impacting commercial use or integration into closed-source projects. There is no explicit mention of ongoing maintenance or community support.

Health Check

Last Commit

5 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days