OpenScholar  by AkariAsai

RAG pipeline for scientific literature synthesis

created 8 months ago
703 stars

Top 49.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

OpenScholar addresses the challenge of synthesizing vast amounts of scientific literature for researchers. It provides a retrieval-augmented language model (LM) that answers queries by first searching for relevant papers and then generating grounded responses, aiding scientists in staying current and finding information efficiently.

How It Works

OpenScholar employs a retrieval-augmented generation (RAG) approach. It first retrieves relevant scientific papers using offline indexing or online APIs (Semantic Scholar, You.com). The retrieved passages are then fed into a language model, such as Llama 3.1 8B or proprietary models like GPT-4o, to generate a synthesized answer. Advanced features include reranking passages for relevance and a self-feedback loop for improved generation quality.

Quick Start & Requirements

  • Installation: conda create -n os_env python=3.10.0, conda activate os_env, pip install -r requirements.txt, python -m spacy download en_core_web_sm.
  • API Keys: Requires S2_API_KEY (Semantic Scholar) and optionally YOUR_API_KEY (You.com).
  • Dependencies: Python 3.10, spaCy, Conda.
  • Resources: Training the 8B model requires 8x A100 GPUs. The peS2o retriever requires significant CPU memory for its large index.
  • Links: Blog, Demo, Paper, Model checkpoints and data, ScholarQABench, OpenScholar_ExpertEval.

Highlighted Details

  • Offers pre-trained models like OpenScholar/Llama-3.1_OpenScholar-8B and OpenScholar/OpenScholar_Reranker.
  • Supports both open-source (Llama 3.1) and proprietary (GPT-4o) LLMs.
  • Includes pipelines for Retriever + Reranker and Retriever Self-reflective Generation.
  • Provides offline retrieval results and plans to release an efficient sparse-dense retriever API.

Maintenance & Community

The project is led by Akari Asai, who notes potential delays in response due to job applications. For demo-related questions, a Google Form is provided.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The peS2o retriever requires substantial CPU memory. The project is actively being developed, with plans for future API releases. Response times from the primary contact may vary.

Health Check
Last commit

3 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
28 stars in the last 90 days

Explore Similar Projects

Starred by Jason Liu Jason Liu(Author of Instructor) and Ross Taylor Ross Taylor(Cofounder of General Reasoning; Creator of Papers with Code).

Search-R1 by PeterGriffinJin

1.3%
3k
RL framework for training LLMs to use search engines
created 5 months ago
updated 3 weeks ago
Feedback? Help us improve.