RAG pipeline for scientific literature synthesis
Top 49.6% on sourcepulse
OpenScholar addresses the challenge of synthesizing vast amounts of scientific literature for researchers. It provides a retrieval-augmented language model (LM) that answers queries by first searching for relevant papers and then generating grounded responses, aiding scientists in staying current and finding information efficiently.
How It Works
OpenScholar employs a retrieval-augmented generation (RAG) approach. It first retrieves relevant scientific papers using offline indexing or online APIs (Semantic Scholar, You.com). The retrieved passages are then fed into a language model, such as Llama 3.1 8B or proprietary models like GPT-4o, to generate a synthesized answer. Advanced features include reranking passages for relevance and a self-feedback loop for improved generation quality.
Quick Start & Requirements
conda create -n os_env python=3.10.0
, conda activate os_env
, pip install -r requirements.txt
, python -m spacy download en_core_web_sm
.S2_API_KEY
(Semantic Scholar) and optionally YOUR_API_KEY
(You.com).Highlighted Details
OpenScholar/Llama-3.1_OpenScholar-8B
and OpenScholar/OpenScholar_Reranker
.Maintenance & Community
The project is led by Akari Asai, who notes potential delays in response due to job applications. For demo-related questions, a Google Form is provided.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not detailed.
Limitations & Caveats
The peS2o retriever requires substantial CPU memory. The project is actively being developed, with plans for future API releases. Response times from the primary contact may vary.
3 months ago
1+ week