Python package for web text extraction
Top 11.0% on sourcepulse
Trafilatura is a Python package and command-line tool for web crawling, scraping, and extracting main text, metadata, and comments from HTML. It targets researchers and developers needing to process raw web data into structured formats like CSV, JSON, or Markdown, offering a robust and efficient solution for content acquisition.
How It Works
Trafilatura employs a modular design, combining crawling capabilities with sophisticated extraction algorithms. It balances precision (limiting noise from headers/footers) and recall (including all valid content), utilizing generic algorithms similar to jusText and readability. The tool supports sitemaps and feeds for discovery, handles URL management, and offers parallel processing for efficiency.
Quick Start & Requirements
pip install trafilatura
Highlighted Details
Maintenance & Community
Actively maintained by Adrien Barbaresi, with contributions from a community of users and developers. Sponsorship is encouraged for continued development.
Licensing & Compatibility
Distributed under the Apache 2.0 license (versions prior to v1.8.0 were under GPLv3+). Compatible with commercial use and closed-source linking.
Limitations & Caveats
While robust, the project's future maintenance is dependent on community support and sponsorships.
2 months ago
Inactive