Discover and explore top open-source AI tools and projects—updated daily.
scrapinghubBenchmarking article extraction quality for web content parsers
Top 77.6% on SourcePulse
This repository provides a comprehensive benchmark for evaluating the quality of article body extraction from web pages. It offers a dataset and evaluation scripts to compare numerous open-source libraries and commercial services, enabling developers and researchers to select the most effective tools for extracting core content from articles.
How It Works
The project benchmarks article body extraction quality by comparing a wide array of open-source Python and Rust libraries, alongside commercial services like Zyte Automatic Extraction and Diffbot. It utilizes a curated dataset with ground truth and prediction files, focusing specifically on the challenging task of accurately extracting the main article text from diverse web page structures. The evaluation methodology yields quantitative metrics such as F1 score, precision, recall, and accuracy, presented with standard deviations.
Quick Start & Requirements
requirements.txt and can be installed via make run-all within a Python virtual environment.https://www.zyte.com/whitepaper-ebook/in-depth-analysis-and-evaluation-on-the-quality-of-article-body-extraction/, Technical report (v1.0.0): https://github.com/scrapinghub/article-extraction-benchmark/releases/tag/v1.0.0.Highlighted Details
trafilatura, newspaper4k, BeautifulSoup) and Rust crates (e.g., rs_trafilatura, readabilityrs), plus commercial services.Maintenance & Community
The provided README does not detail specific maintenance schedules, active contributors, sponsorships, or community channels like Discord or Slack. Contributions are noted via pull requests.
Licensing & Compatibility
The project is released under the MIT license, which is permissive for both open-source and commercial use. No specific compatibility restrictions are mentioned.
Limitations & Caveats
This benchmark strictly evaluates only the article body extraction field, omitting other article metadata such as headlines, authors, or publication dates. The HTML files were fetched with JavaScript rendering disabled by default.
1 day ago
Inactive
finic-ai
nlmatics
adbar
fighting41love