OpenResearcher  by TIGER-AI-Lab

Agentic LLM for long-horizon deep research trajectory synthesis

Created 3 weeks ago

New!

391 stars

Top 73.6% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

OpenResearcher provides a fully open-source agentic LLM pipeline designed for complex, long-horizon deep research tasks. It targets researchers and power users seeking to automate and advance deep research trajectories, offering a competitive alternative to proprietary models with a transparent and reproducible methodology. The project's key benefit is its high accuracy and comprehensive open-source recipe, enabling community progress.

How It Works

The system employs a 30B-A3B agentic large language model trained on a 96K high-quality DeepResearch trajectory dataset. A novel, self-built retriever operating over an 11B-token corpus generates data at scale, bypassing the need for costly external Search APIs. This approach allows for efficient, low-cost training and deployment while maintaining leading performance on deep research benchmarks.

Quick Start & Requirements

  • Primary Install: Clone the repository and install dependencies using uv within a Python 3.12 virtual environment. Key commands include uv venv --python 3.12, source .venv/bin/activate, and uv pip install -e .. Tevatron installation is required for BrowseComp-plus.
  • Prerequisites: Requires Linux, Python 3.12, and OpenJDK 21. Substantial GPU resources are noted, specifically 8 * A100 80G Nvidia GPUs, though other setups may function with parameter adjustments. API keys for Serper and OpenAI are necessary for specific functionalities.
  • Links:
    • HuggingFace: 🤗 HuggingFace
    • Blog: Blog
    • Slack: Slack
    • Official Docs/Demos: Not explicitly linked, but news mentions a demo video and NVIDIA NeMo Data Designer integration.

Highlighted Details

  • Achieves 54.8% accuracy on BrowseComp-Plus, surpassing GPT-4.1, Claude-Opus-4, and Gemini-2.5-Pro.
  • Fully open-source: includes the 96K DeepResearch trajectory dataset, the 30B-A3B model, and the complete training/evaluation recipe.
  • Scalable, low-cost data generation via a custom retriever, eliminating reliance on external search APIs.
  • Demonstrates leading performance across benchmarks including BrowseComp-Plus, BrowseComp, GAIA, and xbench-DeepSearch.

Maintenance & Community

Core contributors include Zhuofeng Li, Dongfu Jiang, Xueguang Ma, Haoxiang Zhang, and Ping Nie, with advisors Wenhu Chen and Yu Zhang. Support for GPU and API resources has been provided by Lambda, Netmind AI, Verdent AI, and Serper. Community engagement is facilitated via Slack and WeChat. Contributions and feedback are welcomed via issues, pull requests, or email.

Licensing & Compatibility

  • License: The specific open-source license is not detailed in the provided README.
  • Compatibility: No explicit notes on commercial use or closed-source linking are present.

Limitations & Caveats

The project has substantial hardware requirements, specifically citing 8x A100 80G GPUs, which may be an adoption barrier. The absence of a clearly stated software license in the README is a significant omission for due diligence and adoption decisions. Setup involves multiple steps and external API key configurations.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
392 stars in the last 23 days

Explore Similar Projects

Feedback? Help us improve.