deep_research_bench by Ayanami0730

Benchmark for evaluating deep research agents

Created 8 months ago

596 stars

Top 54.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Rotem Weiss

Cofounder of Tavily

Project Summary

DeepResearch Bench provides a comprehensive evaluation framework for Deep Research Agents (DRAs), addressing the need for systematic assessment of their capabilities across diverse academic and professional domains. It comprises 100 PhD-level tasks across 22 fields, designed to reflect real-world research demands and challenge advanced AI agents.

How It Works

The benchmark employs two primary evaluation methodologies: RACE (Reference-based Adaptive Criteria-driven Evaluation) and FACT (Framework for Factual Abundance and Citation Trustworthiness). RACE assesses report quality across comprehensiveness, insight, instruction-following, and readability using dynamic, task-specific criteria and reference reports. FACT evaluates information retrieval and grounding by extracting factual claims and their cited URLs, verifying support, and calculating citation accuracy and effective citation counts.

Quick Start & Requirements

Installation: Clone the repository, navigate to the directory, and install dependencies using pip install -r requirements.txt.
Prerequisites: Python 3.9+, Gemini API key, Jina API key. API keys must be set as environment variables (GEMINI_API_KEY, JINA_API_KEY).
Setup: Estimated setup time is minimal, primarily involving API key configuration.
Documentation: The README provides a detailed quick start flow and project structure.

Highlighted Details

Evaluates 100 PhD-level research tasks across 22 domains, balanced by real-world query analysis.
RACE evaluation uses dynamic criteria and reference reports for multi-dimensional quality assessment.
FACT evaluation focuses on factual accuracy and citation trustworthiness through web scraping and LLM judgment.
Leaderboard and raw data are available on Hugging Face, with partnerships with AGI-Eval and Nvidia-AIQ-Research.

Maintenance & Community

The project is actively maintained, with recent updates including evaluations of Kimi-Researcher, Doubao-DeepResearch, and Claude-Researcher, and infrastructure upgrades. Contact is available via email for leaderboard ranking inquiries.

Licensing & Compatibility

The repository is not explicitly licensed in the provided README. The citation format suggests it is intended for academic research.

Limitations & Caveats

FACT evaluation results should be interpreted with caution due to potential variations in web scraping capabilities between Jina AI and internal systems used by some companies. The benchmark requires specific API keys for Gemini and Jina, which may incur costs.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

44 stars in the last 30 days