OpenScholar  by AkariAsai

RAG pipeline for scientific literature synthesis

Created 10 months ago
721 stars

Top 47.7% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

OpenScholar addresses the challenge of synthesizing vast amounts of scientific literature for researchers. It provides a retrieval-augmented language model (LM) that answers queries by first searching for relevant papers and then generating grounded responses, aiding scientists in staying current and finding information efficiently.

How It Works

OpenScholar employs a retrieval-augmented generation (RAG) approach. It first retrieves relevant scientific papers using offline indexing or online APIs (Semantic Scholar, You.com). The retrieved passages are then fed into a language model, such as Llama 3.1 8B or proprietary models like GPT-4o, to generate a synthesized answer. Advanced features include reranking passages for relevance and a self-feedback loop for improved generation quality.

Quick Start & Requirements

  • Installation: conda create -n os_env python=3.10.0, conda activate os_env, pip install -r requirements.txt, python -m spacy download en_core_web_sm.
  • API Keys: Requires S2_API_KEY (Semantic Scholar) and optionally YOUR_API_KEY (You.com).
  • Dependencies: Python 3.10, spaCy, Conda.
  • Resources: Training the 8B model requires 8x A100 GPUs. The peS2o retriever requires significant CPU memory for its large index.
  • Links: Blog, Demo, Paper, Model checkpoints and data, ScholarQABench, OpenScholar_ExpertEval.

Highlighted Details

  • Offers pre-trained models like OpenScholar/Llama-3.1_OpenScholar-8B and OpenScholar/OpenScholar_Reranker.
  • Supports both open-source (Llama 3.1) and proprietary (GPT-4o) LLMs.
  • Includes pipelines for Retriever + Reranker and Retriever Self-reflective Generation.
  • Provides offline retrieval results and plans to release an efficient sparse-dense retriever API.

Maintenance & Community

The project is led by Akari Asai, who notes potential delays in response due to job applications. For demo-related questions, a Google Form is provided.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The peS2o retriever requires substantial CPU memory. The project is actively being developed, with plans for future API releases. Response times from the primary contact may vary.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.