paper-qa  by Future-House

RAG for scientific documents, providing accurate answers with citations

created 2 years ago
7,583 stars

Top 7.0% on sourcepulse

GitHubView on GitHub
Project Summary

PaperQA2 is a Python package designed for high-accuracy Retrieval Augmented Generation (RAG) specifically tailored for scientific documents. It empowers researchers and power users to efficiently extract information, answer complex questions, and perform tasks like summarization and contradiction detection from large collections of PDFs and text files, providing grounded responses with in-text citations.

How It Works

PaperQA2 employs an agentic RAG pipeline. It begins by identifying candidate papers, potentially using LLM-generated keywords. These papers are then chunked, embedded, and added to a searchable index. For a given query, the system embeds the query, retrieves relevant document chunks, and uses an LLM to re-score and summarize these chunks contextually. Finally, the LLM generates an answer based on these curated summaries, incorporating citations. This iterative process allows for sophisticated query refinement and evidence gathering.

Quick Start & Requirements

  • Install via pip: pip install paper-qa>=5
  • Requires Python 3.11+
  • OpenAI API key (or other LiteLLM-compatible LLM provider) is recommended for default models.
  • Optional API keys for Crossref and Semantic Scholar to avoid rate limits on metadata services.
  • Quickstart example: cd my_papers && pqa ask 'How can carbon nanotubes be manufactured at a large scale?'
  • Official Docs: https://github.com/Future-House/paper-qa#readme

Highlighted Details

  • Claims "superhuman performance" on scientific tasks.
  • Integrates with Semantic Scholar, Crossref, and Unpaywall for metadata enrichment.
  • Supports agentic workflows, allowing LLMs to iteratively use tools for search and evidence gathering.
  • Highly configurable with bundled settings for various use cases (e.g., high_quality, fast, contracrow).
  • Supports local embedding models (Sentence Transformers) and locally hosted LLMs (via llama.cpp or Ollama).

Maintenance & Community

  • Developed by Future-House, with contributions from multiple authors cited in the papers.
  • Active development indicated by version 5 (PaperQA2) release.
  • Links to relevant papers and citation information are provided.

Licensing & Compatibility

  • The project is released under the MIT License.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Docs objects pickled from prior versions (pre-v5) are incompatible and need rebuilding.
  • While it aims for high accuracy, results can differ from published papers due to internal tool differences and API access limitations not available in the open-source version.
  • Users must provide their own PDFs; the tool cannot automatically access papers.
Health Check
Last commit

2 days ago

Responsiveness

1 week

Pull Requests (30d)
48
Issues (30d)
5
Star History
381 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.