paper-qa by Future-House

RAG for scientific documents, providing accurate answers with citations

Created 2 years ago

7,982 stars

Top 6.5% on SourcePulse

View on GitHub

9 Experts Love This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Vincent Weisser

Cofounder of Prime Intellect

Will DePue

Coauthor of Sora

Jeff Hammerbacher

Cofounder of Cloudera

and 5 more!

Project Summary

PaperQA2 is a Python package designed for high-accuracy Retrieval Augmented Generation (RAG) specifically tailored for scientific documents. It empowers researchers and power users to efficiently extract information, answer complex questions, and perform tasks like summarization and contradiction detection from large collections of PDFs and text files, providing grounded responses with in-text citations.

How It Works

PaperQA2 employs an agentic RAG pipeline. It begins by identifying candidate papers, potentially using LLM-generated keywords. These papers are then chunked, embedded, and added to a searchable index. For a given query, the system embeds the query, retrieves relevant document chunks, and uses an LLM to re-score and summarize these chunks contextually. Finally, the LLM generates an answer based on these curated summaries, incorporating citations. This iterative process allows for sophisticated query refinement and evidence gathering.

Quick Start & Requirements

Install via pip: pip install paper-qa>=5
Requires Python 3.11+
OpenAI API key (or other LiteLLM-compatible LLM provider) is recommended for default models.
Optional API keys for Crossref and Semantic Scholar to avoid rate limits on metadata services.
Quickstart example: cd my_papers && pqa ask 'How can carbon nanotubes be manufactured at a large scale?'
Official Docs: https://github.com/Future-House/paper-qa#readme

Highlighted Details

Claims "superhuman performance" on scientific tasks.
Integrates with Semantic Scholar, Crossref, and Unpaywall for metadata enrichment.
Supports agentic workflows, allowing LLMs to iteratively use tools for search and evidence gathering.
Highly configurable with bundled settings for various use cases (e.g., high_quality, fast, contracrow).
Supports local embedding models (Sentence Transformers) and locally hosted LLMs (via llama.cpp or Ollama).

Maintenance & Community

Developed by Future-House, with contributions from multiple authors cited in the papers.
Active development indicated by version 5 (PaperQA2) release.
Links to relevant papers and citation information are provided.

Licensing & Compatibility

The project is released under the MIT License.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

Docs objects pickled from prior versions (pre-v5) are incompatible and need rebuilding.
While it aims for high accuracy, results can differ from published papers due to internal tool differences and API access limitations not available in the open-source version.
Users must provide their own PDFs; the tool cannot automatically access papers.

Health Check

Last Commit

4 days ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

89 stars in the last 30 days