Discover and explore top open-source AI tools and projects—updated daily.
Ayanami0730Benchmark for evaluating deep research agents
Top 59.5% on SourcePulse
DeepResearch Bench provides a comprehensive evaluation framework for Deep Research Agents (DRAs), addressing the need for systematic assessment of their capabilities across diverse academic and professional domains. It comprises 100 PhD-level tasks across 22 fields, designed to reflect real-world research demands and challenge advanced AI agents.
How It Works
The benchmark employs two primary evaluation methodologies: RACE (Reference-based Adaptive Criteria-driven Evaluation) and FACT (Framework for Factual Abundance and Citation Trustworthiness). RACE assesses report quality across comprehensiveness, insight, instruction-following, and readability using dynamic, task-specific criteria and reference reports. FACT evaluates information retrieval and grounding by extracting factual claims and their cited URLs, verifying support, and calculating citation accuracy and effective citation counts.
Quick Start & Requirements
pip install -r requirements.txt.GEMINI_API_KEY, JINA_API_KEY).Highlighted Details
Maintenance & Community
The project is actively maintained, with recent updates including evaluations of Kimi-Researcher, Doubao-DeepResearch, and Claude-Researcher, and infrastructure upgrades. Contact is available via email for leaderboard ranking inquiries.
Licensing & Compatibility
The repository is not explicitly licensed in the provided README. The citation format suggests it is intended for academic research.
Limitations & Caveats
FACT evaluation results should be interpreted with caution due to potential variations in web scraping capabilities between Jina AI and internal systems used by some companies. The benchmark requires specific API keys for Gemini and Jina, which may incur costs.
1 month ago
Inactive