newspaper  by codelucas

Python library for news article extraction

Created 12 years ago
15,037 stars

Top 3.4% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

newspaper3k is a robust Python 3 library engineered for the efficient scraping, full-text extraction, and metadata curation of online news articles. It serves developers and researchers requiring a programmatic solution to acquire and process content from diverse news sources, offering a significant advantage in automating data collection and analysis workflows.

How It Works

The core of newspaper3k relies on lxml for high-performance HTML parsing, enabling rapid processing of web pages. Upon receiving a URL, the library downloads the article's HTML and applies sophisticated algorithms to discern and extract the primary article content, along with associated metadata like authors and publication dates, thereby streamlining content acquisition pipelines.

Quick Start & Requirements

  • Primary install/run: Usage is demonstrated via Python code snippets, including importing Article and calling article.download().
  • Prerequisites: Requires Python 3. Specific version not detailed.
  • Links: The provided README snippet does not contain direct links to official quick-start guides, documentation, or demo pages.

Highlighted Details

  • Specializes in extracting the main body text and key metadata (e.g., title, authors, publish date) from news articles.
  • Noted for its simplicity and speed, attributed to its underlying lxml parsing engine.
  • Received endorsements from notable figures like Kenneth Reitz (author of the requests library) and was featured on The Changelog.

Maintenance & Community

  • Project health is indicated by continuous integration (CI) badges from Travis CI for build status and Coveralls for code coverage, suggesting active development and testing.
  • The provided text does not include links to community support channels such as Discord or Slack, nor does it detail a public roadmap.

Licensing & Compatibility

  • The specific open-source license governing newspaper3k is not explicitly stated within the provided README excerpt.
  • Consequently, definitive compatibility notes for commercial use or integration within closed-source projects cannot be determined without further license clarification.

Limitations & Caveats

  • The project explicitly warns of a "deprecated and buggy Python2 branch," clearly indicating that Python 3 is the sole supported and recommended environment.
  • Beyond the Python 2 deprecation, the provided text does not detail other potential limitations, such as unsupported platforms, missing features, or known bugs.
Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
46 stars in the last 30 days

Explore Similar Projects

Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.9%
6k
Python package for web text extraction
Created 7 years ago
Updated 7 months ago
Feedback? Help us improve.