paperscraper  by jannisborn

Python package for scraping publication metadata and full text

Created 5 years ago
413 stars

Top 70.8% on SourcePulse

GitHubView on GitHub
Project Summary

This Python package provides tools for scraping publication metadata and full-text files (PDF/XML) from various sources including PubMed, arXiv, medRxiv, bioRxiv, and chemRxiv. It is designed for researchers and data scientists needing to gather publication data, analyze trends, and retrieve full-text articles, offering features like citation count retrieval, journal impact factor lookups, and plotting capabilities for meta-analysis.

How It Works

The package leverages underlying libraries like arxiv, pymed, and scholarly to interact with different databases and APIs. For preprint servers (arXiv, medRxiv, bioRxiv, chemRxiv), it offers the option to download entire server dumps in .jsonl format for local, faster querying. It supports keyword-based searches across multiple sources and includes advanced PDF/XML retrieval with fallback mechanisms (BioC-PMC, eLife) and optional publisher API integration (Wiley, Elsevier) for enhanced access.

Quick Start & Requirements

  • Install via pip: pip install paperscraper
  • For preprint server data (medRxiv, bioRxiv, chemRxiv), download dumps using from paperscraper.get_dumps import medrxiv, biorxiv, chemrxiv. This can take significant time and disk space (hundreds of MB).
  • Google Scholar scraping may encounter captchas, limiting large-scale use.
  • Full-text retrieval from PubMed may be challenging due to publisher restrictions.
  • Official documentation and examples are available within the README.

Highlighted Details

  • Supports scraping metadata and full-text PDFs/XMLs from PubMed, arXiv, medRxiv, bioRxiv, and chemRxiv.
  • Integrates citation counts from Google Scholar and journal impact factors.
  • Includes plotting functions for Venn diagrams and comparative bar plots of search results.
  • Offers fallback mechanisms for PDF retrieval (BioC-PMC, eLife) and optional support for Wiley/Elsevier TDM APIs.

Maintenance & Community

The project has seen contributions from several individuals, with notable improvements in full-text retrieval and error handling. Links to community channels are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Retrieving full-text articles from PubMed can be difficult due to publisher restrictions and paywalls, even with fallback mechanisms. Google Scholar scraping is prone to captchas, hindering large-scale operations. The README notes that using date-specific dump downloads for preprint servers might require restarting the Python interpreter and that subsequent searches will only use the newly downloaded dump.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

s2orc by allenai

0.3%
967
Corpus for NLP/text mining research on scientific papers
Created 5 years ago
Updated 1 year ago
Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.5%
5k
Python package for web text extraction
Created 6 years ago
Updated 6 days ago
Feedback? Help us improve.