article-extraction-benchmark by scrapinghub

Benchmarking article extraction quality for web content parsers

Created 6 years ago

376 stars

Top 75.3% on SourcePulse

Project Summary

This repository provides a comprehensive benchmark for evaluating the quality of article body extraction from web pages. It offers a dataset and evaluation scripts to compare numerous open-source libraries and commercial services, enabling developers and researchers to select the most effective tools for extracting core content from articles.

How It Works

The project benchmarks article body extraction quality by comparing a wide array of open-source Python and Rust libraries, alongside commercial services like Zyte Automatic Extraction and Diffbot. It utilizes a curated dataset with ground truth and prediction files, focusing specifically on the challenging task of accurately extracting the main article text from diverse web page structures. The evaluation methodology yields quantitative metrics such as F1 score, precision, recall, and accuracy, presented with standard deviations.

Quick Start & Requirements

Primary install / run command: Clone the repository. Evaluation scripts require Python 3.6+. Dependencies for re-generating output files are listed in requirements.txt and can be installed via make run-all within a Python virtual environment.
Non-default prerequisites: Python 3.6+.
Links: Whitepaper: https://www.zyte.com/whitepaper-ebook/in-depth-analysis-and-evaluation-on-the-quality-of-article-body-extraction/, Technical report (v1.0.0): https://github.com/scrapinghub/article-extraction-benchmark/releases/tag/v1.0.0.

Highlighted Details

Evaluates dozens of Python libraries (e.g., trafilatura, newspaper4k, BeautifulSoup) and Rust crates (e.g., rs_trafilatura, readabilityrs), plus commercial services.
Provides detailed performance metrics (F1, precision, recall, accuracy) with standard deviations across multiple evaluation rounds.
Includes a JSON dataset with article bodies, URLs, and ground truth, alongside raw HTML files.
Focuses exclusively on the article body extraction field, a critical component of content parsing.

Maintenance & Community

The provided README does not detail specific maintenance schedules, active contributors, sponsorships, or community channels like Discord or Slack. Contributions are noted via pull requests.

Licensing & Compatibility

The project is released under the MIT license, which is permissive for both open-source and commercial use. No specific compatibility restrictions are mentioned.

Limitations & Caveats

This benchmark strictly evaluates only the article body extraction field, omitting other article metadata such as headlines, authors, or publication dates. The HTML files were fetched with JavaScript rendering disabled by default.

article-extraction-benchmark by scrapinghub

Explore Similar Projects

MinerU-HTML by opendatalab

google-surf-mcp by HarimxChoi

doctran by finic-ai

langchain-benchmarks by langchain-ai

openwebtext by yet-another-account

thepipe by emcf

llmsherpa by nlmatics

onefilellm by jimmc414

OmniDocBench by opendatalab

BestBlogs by ginobefun

trafilatura by adbar

newspaper by codelucas