lazynlp  by chiphuyen

Web scraping library for creating massive datasets

created 6 years ago
2,190 stars

Top 21.1% on sourcepulse

GitHubView on GitHub
Project Summary

This library provides tools for crawling, cleaning, and deduplicating web pages to construct large, monolingual text datasets, targeting researchers and developers building large language models. It enables the creation of datasets potentially larger than OpenAI's WebText by automating the acquisition and purification of web content.

How It Works

The library employs a multi-stage process: URL acquisition from sources like Reddit dumps and Gutenberg books, followed by deduplication to avoid redundant content. It then downloads webpages, automatically cleaning them by removing HTML tags, decoding UTF-8, transliterating characters, collapsing whitespace, and unescaping HTML. Finally, it uses Bloom filters for efficient, large-scale deduplication of cleaned text content based on n-gram overlap, allowing users to filter out significantly overlapping documents.

Quick Start & Requirements

Highlighted Details

  • Enables creation of datasets exceeding 50GB of pure text.
  • Processes 1GB of text in approximately 3 hours when running 30 scripts in parallel.
  • Includes built-in exclusion lists for scraper-unfriendly domains and extensions.
  • Offers configurable n-gram granularity ('char' or 'word') for deduplication.

Maintenance & Community

  • The project appears to be a personal project by Chip Huyen, with no explicit mention of active maintenance or community channels.

Licensing & Compatibility

  • The README does not explicitly state a license.

Limitations & Caveats

The project's setup involves manual URL acquisition for some sources, and the README does not specify a license, potentially impacting commercial use or integration into closed-source projects. There is no explicit mention of ongoing maintenance or community support.

Health Check
Last commit

4 years ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.