trafilatura  by adbar

Python package for web text extraction

created 6 years ago
4,523 stars

Top 11.0% on sourcepulse

GitHubView on GitHub
Project Summary

Trafilatura is a Python package and command-line tool for web crawling, scraping, and extracting main text, metadata, and comments from HTML. It targets researchers and developers needing to process raw web data into structured formats like CSV, JSON, or Markdown, offering a robust and efficient solution for content acquisition.

How It Works

Trafilatura employs a modular design, combining crawling capabilities with sophisticated extraction algorithms. It balances precision (limiting noise from headers/footers) and recall (including all valid content), utilizing generic algorithms similar to jusText and readability. The tool supports sitemaps and feeds for discovery, handles URL management, and offers parallel processing for efficiency.

Quick Start & Requirements

Highlighted Details

  • Outperforms other open-source libraries in text extraction benchmarks, including ScrapingHub's article extraction benchmark and an empirical comparison by Bevendorff et al. (2023).
  • Extracts main text, metadata (title, author, date, etc.), comments, links, images, and tables.
  • Supports multiple output formats: TXT, Markdown, CSV, JSON, HTML, XML, and XML-TEI.
  • Includes optional add-ons for language detection and speed optimizations.

Maintenance & Community

Actively maintained by Adrien Barbaresi, with contributions from a community of users and developers. Sponsorship is encouraged for continued development.

Licensing & Compatibility

Distributed under the Apache 2.0 license (versions prior to v1.8.0 were under GPLv3+). Compatible with commercial use and closed-source linking.

Limitations & Caveats

While robust, the project's future maintenance is dependent on community support and sponsorships.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
350 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.