OpenResearcher  by GAIR-NLP

Scientific research assistant for answering research queries

created 1 year ago
452 stars

Top 67.7% on sourcepulse

GitHubView on GitHub
Project Summary

OpenResearcher is an AI-powered scientific research assistant designed to answer queries using the arXiv corpus. It targets researchers and power users seeking accelerated access to the latest scientific insights, offering a competitive alternative to existing RAG systems.

How It Works

OpenResearcher employs a Retrieval-Augmented Generation (RAG) architecture. It leverages both Qdrant for vector search of paper content and Elasticsearch for metadata retrieval. This dual-vector-store approach aims to provide richer and more relevant answers by combining semantic similarity with structured metadata search, outperforming other RAG systems in human and GPT-4 evaluations for correctness, richness, and relevance.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (python=3.10), activate it, cd into the directory, and run pip install -r requirements.txt.
  • Vector Search: Requires running Qdrant (docker pull qdrant/qdrant and docker run ...) and Elasticsearch (via Docker).
  • LLMs: Supports OpenAI, Deepseek, Aliyun APIs, and Hugging Face models via vLLM.
  • Web Search: Requires a Bing Search API key.
  • Data: Needs arXiv HTML data and metadata downloaded to /data.
  • Run: Start Qdrant and Elasticsearch retriever servers, then run streamlit run ui_app.py.
  • Docs: Qdrant Quickstart, Elasticsearch Docker, vLLM.

Highlighted Details

  • Outperforms Perplexity, iAsk.Ai, You.com, Phind, and Naive RAG in human evaluations for correctness (10 wins vs. 7 losses), richness (25 wins vs. 1 loss), and relevance (15 wins vs. 2 losses).
  • Supports a wide range of LLMs via OpenAI-compatible APIs and vLLM for open-source models.
  • Integrates Bing Search for web-based information retrieval.
  • Requires significant setup involving Docker for vector databases and data indexing pipelines.

Maintenance & Community

The project is associated with GAIR-NLP and has had its paper accepted by EMNLP Demo Track 2024. Further community or roadmap information is not detailed in the README.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The setup process is complex, requiring the installation and configuration of multiple external services like Qdrant and Elasticsearch. The README does not specify the exact hardware requirements for running the models or indexing the data, which can be substantial.

Health Check
Last commit

9 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Starred by Jason Liu Jason Liu(Author of Instructor) and Ross Taylor Ross Taylor(Cofounder of General Reasoning; Creator of Papers with Code).

Search-R1 by PeterGriffinJin

1.3%
3k
RL framework for training LLMs to use search engines
created 5 months ago
updated 3 weeks ago
Feedback? Help us improve.