OpenScholar by AkariAsai

RAG pipeline for scientific literature synthesis

Created 1 year ago

739 stars

Top 47.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

OpenScholar addresses the challenge of synthesizing vast amounts of scientific literature for researchers. It provides a retrieval-augmented language model (LM) that answers queries by first searching for relevant papers and then generating grounded responses, aiding scientists in staying current and finding information efficiently.

How It Works

OpenScholar employs a retrieval-augmented generation (RAG) approach. It first retrieves relevant scientific papers using offline indexing or online APIs (Semantic Scholar, You.com). The retrieved passages are then fed into a language model, such as Llama 3.1 8B or proprietary models like GPT-4o, to generate a synthesized answer. Advanced features include reranking passages for relevance and a self-feedback loop for improved generation quality.

Quick Start & Requirements

Installation: conda create -n os_env python=3.10.0, conda activate os_env, pip install -r requirements.txt, python -m spacy download en_core_web_sm.
API Keys: Requires S2_API_KEY (Semantic Scholar) and optionally YOUR_API_KEY (You.com).
Dependencies: Python 3.10, spaCy, Conda.
Resources: Training the 8B model requires 8x A100 GPUs. The peS2o retriever requires significant CPU memory for its large index.
Links: Blog, Demo, Paper, Model checkpoints and data, ScholarQABench, OpenScholar_ExpertEval.

Highlighted Details

Offers pre-trained models like OpenScholar/Llama-3.1_OpenScholar-8B and OpenScholar/OpenScholar_Reranker.
Supports both open-source (Llama 3.1) and proprietary (GPT-4o) LLMs.
Includes pipelines for Retriever + Reranker and Retriever Self-reflective Generation.
Provides offline retrieval results and plans to release an efficient sparse-dense retriever API.

Maintenance & Community

The project is led by Akari Asai, who notes potential delays in response due to job applications. For demo-related questions, a Google Form is provided.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The peS2o retriever requires substantial CPU memory. The project is actively being developed, with plans for future API releases. Response times from the primary contact may vary.

OpenScholar by AkariAsai

Explore Similar Projects

OpenResearcher by GAIR-NLP

FLARE by jzbjyb

Rankify by DataScienceUIBK

HiRAG by hhy-huang

pasa by bytedance

RAG-Survey by Tongji-KGLLM

Local_Pdf_Chat_RAG by weiwill88

local-deep-research by LearningCircuit

pyserini by castorini

paper-qa by Future-House

pdfGPT by bhaskatripathi

local-deep-researcher by langchain-ai