OpenResearcher by GAIR-NLP

Scientific research assistant for answering research queries

Created 1 year ago

474 stars

Top 64.3% on SourcePulse

Project Summary

OpenResearcher is an AI-powered scientific research assistant designed to answer queries using the arXiv corpus. It targets researchers and power users seeking accelerated access to the latest scientific insights, offering a competitive alternative to existing RAG systems.

How It Works

OpenResearcher employs a Retrieval-Augmented Generation (RAG) architecture. It leverages both Qdrant for vector search of paper content and Elasticsearch for metadata retrieval. This dual-vector-store approach aims to provide richer and more relevant answers by combining semantic similarity with structured metadata search, outperforming other RAG systems in human and GPT-4 evaluations for correctness, richness, and relevance.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (python=3.10), activate it, cd into the directory, and run pip install -r requirements.txt.
Vector Search: Requires running Qdrant (docker pull qdrant/qdrant and docker run ...) and Elasticsearch (via Docker).
LLMs: Supports OpenAI, Deepseek, Aliyun APIs, and Hugging Face models via vLLM.
Web Search: Requires a Bing Search API key.
Data: Needs arXiv HTML data and metadata downloaded to /data.
Run: Start Qdrant and Elasticsearch retriever servers, then run streamlit run ui_app.py.
Docs: Qdrant Quickstart, Elasticsearch Docker, vLLM.

Highlighted Details

Outperforms Perplexity, iAsk.Ai, You.com, Phind, and Naive RAG in human evaluations for correctness (10 wins vs. 7 losses), richness (25 wins vs. 1 loss), and relevance (15 wins vs. 2 losses).
Supports a wide range of LLMs via OpenAI-compatible APIs and vLLM for open-source models.
Integrates Bing Search for web-based information retrieval.
Requires significant setup involving Docker for vector databases and data indexing pipelines.

Maintenance & Community

The project is associated with GAIR-NLP and has had its paper accepted by EMNLP Demo Track 2024. Further community or roadmap information is not detailed in the README.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The setup process is complex, requiring the installation and configuration of multiple external services like Qdrant and Elasticsearch. The README does not specify the exact hardware requirements for running the models or indexing the data, which can be substantial.

OpenResearcher by GAIR-NLP

Explore Similar Projects

multimodal-search-r1 by EvolvingLMMs-Lab

FLARE by jzbjyb

nucliadb by nuclia

Rankify by DataScienceUIBK

OpenScholar by AkariAsai

web-explorer by langchain-ai

GraphRAG4OpenWebUI by win4r

rag-search by thinkany-ai

elasticsearch-labs by elastic

local-deep-research by LearningCircuit

paper-qa by Future-House

Perplexica by ItzCrazyKns