trafilatura  by adbar

Python package for web text extraction

Created 6 years ago
4,679 stars

Top 10.6% on SourcePulse

GitHubView on GitHub
Project Summary

Trafilatura is a Python package and command-line tool for web crawling, scraping, and extracting main text, metadata, and comments from HTML. It targets researchers and developers needing to process raw web data into structured formats like CSV, JSON, or Markdown, offering a robust and efficient solution for content acquisition.

How It Works

Trafilatura employs a modular design, combining crawling capabilities with sophisticated extraction algorithms. It balances precision (limiting noise from headers/footers) and recall (including all valid content), utilizing generic algorithms similar to jusText and readability. The tool supports sitemaps and feeds for discovery, handles URL management, and offers parallel processing for efficiency.

Quick Start & Requirements

Highlighted Details

  • Outperforms other open-source libraries in text extraction benchmarks, including ScrapingHub's article extraction benchmark and an empirical comparison by Bevendorff et al. (2023).
  • Extracts main text, metadata (title, author, date, etc.), comments, links, images, and tables.
  • Supports multiple output formats: TXT, Markdown, CSV, JSON, HTML, XML, and XML-TEI.
  • Includes optional add-ons for language detection and speed optimizations.

Maintenance & Community

Actively maintained by Adrien Barbaresi, with contributions from a community of users and developers. Sponsorship is encouraged for continued development.

Licensing & Compatibility

Distributed under the Apache 2.0 license (versions prior to v1.8.0 were under GPLv3+). Compatible with commercial use and closed-source linking.

Limitations & Caveats

While robust, the project's future maintenance is dependent on community support and sponsorships.

Health Check
Last Commit

6 days ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
0
Star History
98 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.