paperai by neuml

AI for scientific paper analysis and report generation

Created 5 years ago

1,740 stars

Top 24.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Gabriel Almeida

Cofounder of Langflow

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

This project provides an AI-powered application for semantic search and workflow automation on medical and scientific papers. It targets researchers and data scientists, enabling them to efficiently generate reports and extract insights from large document repositories using LLMs and Retrieval Augmented Generation (RAG).

How It Works

PaperAI leverages a RAG pipeline built on top of txtai embeddings. It indexes articles, parsing them into sections and storing them with metadata. Embeddings are generated over the entire corpus, allowing for semantic search. When a query is run, the system retrieves relevant document sections, feeds them as context to an LLM with a configurable prompt, and generates structured outputs like reports or annotated PDFs. This approach allows for bulk LLM inference and automated data extraction from research papers.

Quick Start & Requirements

Install via pip: pip install paperai
Requires Python 3.10+.
Docker image available.
See examples for notebooks and applications.

Highlighted Details

Supports bulk LLM inference and report generation in Markdown, CSV, or PDF annotations.
Enables dynamic column generation in reports driven by LLM questions and RAG queries.
Integrates txtai for embeddings and RAG pipelines, with configurable LLM backends.
Can process large datasets of scientific papers for automated research tasks.

Maintenance & Community

Developed by NeuML.
Recognized in articles for its application in COVID-19 research.

Licensing & Compatibility

License: MIT.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

The annotation feature for PDFs requires the original PDF files to be present and accessible. The project's core functionality relies on the txtai library, and performance may vary based on the chosen LLM and embedding models.

Health Check

Last Commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

25 stars in the last 30 days