lazynlp  by chiphuyen

Web scraping library for creating massive datasets

Created 6 years ago
2,207 stars

Top 20.5% on SourcePulse

GitHubView on GitHub
Project Summary

This library provides tools for crawling, cleaning, and deduplicating web pages to construct large, monolingual text datasets, targeting researchers and developers building large language models. It enables the creation of datasets potentially larger than OpenAI's WebText by automating the acquisition and purification of web content.

How It Works

The library employs a multi-stage process: URL acquisition from sources like Reddit dumps and Gutenberg books, followed by deduplication to avoid redundant content. It then downloads webpages, automatically cleaning them by removing HTML tags, decoding UTF-8, transliterating characters, collapsing whitespace, and unescaping HTML. Finally, it uses Bloom filters for efficient, large-scale deduplication of cleaned text content based on n-gram overlap, allowing users to filter out significantly overlapping documents.

Quick Start & Requirements

Highlighted Details

  • Enables creation of datasets exceeding 50GB of pure text.
  • Processes 1GB of text in approximately 3 hours when running 30 scripts in parallel.
  • Includes built-in exclusion lists for scraper-unfriendly domains and extensions.
  • Offers configurable n-gram granularity ('char' or 'word') for deduplication.

Maintenance & Community

  • The project appears to be a personal project by Chip Huyen, with no explicit mention of active maintenance or community channels.

Licensing & Compatibility

  • The README does not explicitly state a license.

Limitations & Caveats

The project's setup involves manual URL acquisition for some sources, and the README does not specify a license, potentially impacting commercial use or integration into closed-source projects. There is no explicit mention of ongoing maintenance or community support.

Health Check
Last Commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
4 more.

dolma by allenai

0.2%
1k
Toolkit for curating datasets for language model pre-training
Created 2 years ago
Updated 2 days ago
Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.5%
5k
Python package for web text extraction
Created 6 years ago
Updated 6 days ago
Feedback? Help us improve.