Python package for scraping publication metadata and full text
Top 74.5% on sourcepulse
This Python package provides tools for scraping publication metadata and full-text files (PDF/XML) from various sources including PubMed, arXiv, medRxiv, bioRxiv, and chemRxiv. It is designed for researchers and data scientists needing to gather publication data, analyze trends, and retrieve full-text articles, offering features like citation count retrieval, journal impact factor lookups, and plotting capabilities for meta-analysis.
How It Works
The package leverages underlying libraries like arxiv
, pymed
, and scholarly
to interact with different databases and APIs. For preprint servers (arXiv, medRxiv, bioRxiv, chemRxiv), it offers the option to download entire server dumps in .jsonl
format for local, faster querying. It supports keyword-based searches across multiple sources and includes advanced PDF/XML retrieval with fallback mechanisms (BioC-PMC, eLife) and optional publisher API integration (Wiley, Elsevier) for enhanced access.
Quick Start & Requirements
pip install paperscraper
from paperscraper.get_dumps import medrxiv, biorxiv, chemrxiv
. This can take significant time and disk space (hundreds of MB).Highlighted Details
Maintenance & Community
The project has seen contributions from several individuals, with notable improvements in full-text retrieval and error handling. Links to community channels are not explicitly provided in the README.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
Retrieving full-text articles from PubMed can be difficult due to publisher restrictions and paywalls, even with fallback mechanisms. Google Scholar scraping is prone to captchas, hindering large-scale operations. The README notes that using date-specific dump downloads for preprint servers might require restarting the Python interpreter and that subsequent searches will only use the newly downloaded dump.
1 day ago
1 day