paperscraper by jannisborn

Python package for scraping publication metadata and full text

Created 5 years ago

468 stars

Top 64.9% on SourcePulse

Project Summary

This Python package provides tools for scraping publication metadata and full-text files (PDF/XML) from various sources including PubMed, arXiv, medRxiv, bioRxiv, and chemRxiv. It is designed for researchers and data scientists needing to gather publication data, analyze trends, and retrieve full-text articles, offering features like citation count retrieval, journal impact factor lookups, and plotting capabilities for meta-analysis.

How It Works

The package leverages underlying libraries like arxiv, pymed, and scholarly to interact with different databases and APIs. For preprint servers (arXiv, medRxiv, bioRxiv, chemRxiv), it offers the option to download entire server dumps in .jsonl format for local, faster querying. It supports keyword-based searches across multiple sources and includes advanced PDF/XML retrieval with fallback mechanisms (BioC-PMC, eLife) and optional publisher API integration (Wiley, Elsevier) for enhanced access.

Quick Start & Requirements

Install via pip: pip install paperscraper
For preprint server data (medRxiv, bioRxiv, chemRxiv), download dumps using from paperscraper.get_dumps import medrxiv, biorxiv, chemrxiv. This can take significant time and disk space (hundreds of MB).
Google Scholar scraping may encounter captchas, limiting large-scale use.
Full-text retrieval from PubMed may be challenging due to publisher restrictions.
Official documentation and examples are available within the README.

Highlighted Details

Supports scraping metadata and full-text PDFs/XMLs from PubMed, arXiv, medRxiv, bioRxiv, and chemRxiv.
Integrates citation counts from Google Scholar and journal impact factors.
Includes plotting functions for Venn diagrams and comparative bar plots of search results.
Offers fallback mechanisms for PDF retrieval (BioC-PMC, eLife) and optional support for Wiley/Elsevier TDM APIs.

Maintenance & Community

The project has seen contributions from several individuals, with notable improvements in full-text retrieval and error handling. Links to community channels are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Retrieving full-text articles from PubMed can be difficult due to publisher restrictions and paywalls, even with fallback mechanisms. Google Scholar scraping is prone to captchas, hindering large-scale operations. The README notes that using date-specific dump downloads for preprint servers might require restarting the Python interpreter and that subsequent searches will only use the newly downloaded dump.

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

18 stars in the last 30 days