trafilatura by adbar

Python package for web text extraction

Created 6 years ago

5,334 stars

Top 9.3% on SourcePulse

View on GitHub

4 Experts Love This Project

Li Jiang

Coauthor of AutoGen; Engineer at Microsoft

DevRel at Google DeepMind

Project Summary

Trafilatura is a Python package and command-line tool for web crawling, scraping, and extracting main text, metadata, and comments from HTML. It targets researchers and developers needing to process raw web data into structured formats like CSV, JSON, or Markdown, offering a robust and efficient solution for content acquisition.

How It Works

Trafilatura employs a modular design, combining crawling capabilities with sophisticated extraction algorithms. It balances precision (limiting noise from headers/footers) and recall (including all valid content), utilizing generic algorithms similar to jusText and readability. The tool supports sitemaps and feeds for discovery, handles URL management, and offers parallel processing for efficiency.

Quick Start & Requirements

Install via pip: pip install trafilatura
Requirements: Python 3.x. No specific hardware or GPU is mandated.
Documentation: https://trafilatura.readthedocs.io/
Usage examples: https://github.com/adbar/trafilatura#usage

Highlighted Details

Outperforms other open-source libraries in text extraction benchmarks, including ScrapingHub's article extraction benchmark and an empirical comparison by Bevendorff et al. (2023).
Extracts main text, metadata (title, author, date, etc.), comments, links, images, and tables.
Supports multiple output formats: TXT, Markdown, CSV, JSON, HTML, XML, and XML-TEI.
Includes optional add-ons for language detection and speed optimizations.

Maintenance & Community

Actively maintained by Adrien Barbaresi, with contributions from a community of users and developers. Sponsorship is encouraged for continued development.

Licensing & Compatibility

Distributed under the Apache 2.0 license (versions prior to v1.8.0 were under GPLv3+). Compatible with commercial use and closed-source linking.

Limitations & Caveats

While robust, the project's future maintenance is dependent on community support and sponsorships.

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History