paperai  by neuml

AI for scientific paper analysis and report generation

Created 5 years ago
1,469 stars

Top 28.0% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides an AI-powered application for semantic search and workflow automation on medical and scientific papers. It targets researchers and data scientists, enabling them to efficiently generate reports and extract insights from large document repositories using LLMs and Retrieval Augmented Generation (RAG).

How It Works

PaperAI leverages a RAG pipeline built on top of txtai embeddings. It indexes articles, parsing them into sections and storing them with metadata. Embeddings are generated over the entire corpus, allowing for semantic search. When a query is run, the system retrieves relevant document sections, feeds them as context to an LLM with a configurable prompt, and generates structured outputs like reports or annotated PDFs. This approach allows for bulk LLM inference and automated data extraction from research papers.

Quick Start & Requirements

  • Install via pip: pip install paperai
  • Requires Python 3.10+.
  • Docker image available.
  • See examples for notebooks and applications.

Highlighted Details

  • Supports bulk LLM inference and report generation in Markdown, CSV, or PDF annotations.
  • Enables dynamic column generation in reports driven by LLM questions and RAG queries.
  • Integrates txtai for embeddings and RAG pipelines, with configurable LLM backends.
  • Can process large datasets of scientific papers for automated research tasks.

Maintenance & Community

  • Developed by NeuML.
  • Recognized in articles for its application in COVID-19 research.

Licensing & Compatibility

  • License: MIT.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The annotation feature for PDFs requires the original PDF files to be present and accessible. The project's core functionality relies on the txtai library, and performance may vary based on the chosen LLM and embedding models.

Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

s2orc by allenai

0.3%
967
Corpus for NLP/text mining research on scientific papers
Created 5 years ago
Updated 1 year ago
Feedback? Help us improve.