article-extraction-benchmark  by scrapinghub

Benchmarking article extraction quality for web content parsers

Created 6 years ago
362 stars

Top 77.6% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a comprehensive benchmark for evaluating the quality of article body extraction from web pages. It offers a dataset and evaluation scripts to compare numerous open-source libraries and commercial services, enabling developers and researchers to select the most effective tools for extracting core content from articles.

How It Works

The project benchmarks article body extraction quality by comparing a wide array of open-source Python and Rust libraries, alongside commercial services like Zyte Automatic Extraction and Diffbot. It utilizes a curated dataset with ground truth and prediction files, focusing specifically on the challenging task of accurately extracting the main article text from diverse web page structures. The evaluation methodology yields quantitative metrics such as F1 score, precision, recall, and accuracy, presented with standard deviations.

Quick Start & Requirements

  • Primary install / run command: Clone the repository. Evaluation scripts require Python 3.6+. Dependencies for re-generating output files are listed in requirements.txt and can be installed via make run-all within a Python virtual environment.
  • Non-default prerequisites: Python 3.6+.
  • Links: Whitepaper: https://www.zyte.com/whitepaper-ebook/in-depth-analysis-and-evaluation-on-the-quality-of-article-body-extraction/, Technical report (v1.0.0): https://github.com/scrapinghub/article-extraction-benchmark/releases/tag/v1.0.0.

Highlighted Details

  • Evaluates dozens of Python libraries (e.g., trafilatura, newspaper4k, BeautifulSoup) and Rust crates (e.g., rs_trafilatura, readabilityrs), plus commercial services.
  • Provides detailed performance metrics (F1, precision, recall, accuracy) with standard deviations across multiple evaluation rounds.
  • Includes a JSON dataset with article bodies, URLs, and ground truth, alongside raw HTML files.
  • Focuses exclusively on the article body extraction field, a critical component of content parsing.

Maintenance & Community

The provided README does not detail specific maintenance schedules, active contributors, sponsorships, or community channels like Discord or Slack. Contributions are noted via pull requests.

Licensing & Compatibility

The project is released under the MIT license, which is permissive for both open-source and commercial use. No specific compatibility restrictions are mentioned.

Limitations & Caveats

This benchmark strictly evaluates only the article body extraction field, omitting other article metadata such as headlines, authors, or publication dates. The HTML files were fetched with JavaScript rendering disabled by default.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.8%
6k
Python package for web text extraction
Created 7 years ago
Updated 7 months ago
Feedback? Help us improve.