OpenResearcher by TIGER-AI-Lab

Agentic LLM for long-horizon deep research trajectory synthesis

Created 5 months ago

801 stars

Top 43.3% on SourcePulse

View on GitHub

3 Experts Love This Project

Pawel Garbacki

Cofounder of Fireworks AI

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Elvis Saravia

Founder of DAIR.AI

Project Summary

OpenResearcher provides a fully open-source agentic LLM pipeline designed for complex, long-horizon deep research tasks. It targets researchers and power users seeking to automate and advance deep research trajectories, offering a competitive alternative to proprietary models with a transparent and reproducible methodology. The project's key benefit is its high accuracy and comprehensive open-source recipe, enabling community progress.

How It Works

The system employs a 30B-A3B agentic large language model trained on a 96K high-quality DeepResearch trajectory dataset. A novel, self-built retriever operating over an 11B-token corpus generates data at scale, bypassing the need for costly external Search APIs. This approach allows for efficient, low-cost training and deployment while maintaining leading performance on deep research benchmarks.

Quick Start & Requirements

Primary Install: Clone the repository and install dependencies using uv within a Python 3.12 virtual environment. Key commands include uv venv --python 3.12, source .venv/bin/activate, and uv pip install -e .. Tevatron installation is required for BrowseComp-plus.
Prerequisites: Requires Linux, Python 3.12, and OpenJDK 21. Substantial GPU resources are noted, specifically 8 * A100 80G Nvidia GPUs, though other setups may function with parameter adjustments. API keys for Serper and OpenAI are necessary for specific functionalities.
Links:
- HuggingFace: 🤗 HuggingFace
- Blog: Blog
- Slack: Slack
- Official Docs/Demos: Not explicitly linked, but news mentions a demo video and NVIDIA NeMo Data Designer integration.

Highlighted Details

Achieves 54.8% accuracy on BrowseComp-Plus, surpassing GPT-4.1, Claude-Opus-4, and Gemini-2.5-Pro.
Fully open-source: includes the 96K DeepResearch trajectory dataset, the 30B-A3B model, and the complete training/evaluation recipe.
Scalable, low-cost data generation via a custom retriever, eliminating reliance on external search APIs.
Demonstrates leading performance across benchmarks including BrowseComp-Plus, BrowseComp, GAIA, and xbench-DeepSearch.

Maintenance & Community

Core contributors include Zhuofeng Li, Dongfu Jiang, Xueguang Ma, Haoxiang Zhang, and Ping Nie, with advisors Wenhu Chen and Yu Zhang. Support for GPU and API resources has been provided by Lambda, Netmind AI, Verdent AI, and Serper. Community engagement is facilitated via Slack and WeChat. Contributions and feedback are welcomed via issues, pull requests, or email.

Licensing & Compatibility

License: The specific open-source license is not detailed in the provided README.
Compatibility: No explicit notes on commercial use or closed-source linking are present.

Limitations & Caveats

The project has substantial hardware requirements, specifically citing 8x A100 80G GPUs, which may be an adoption barrier. The absence of a clearly stated software license in the README is a significant omission for due diligence and adoption decisions. Setup involves multiple steps and external API key configurations.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

30 stars in the last 30 days