Discover and explore top open-source AI tools and projects—updated daily.
texttronBenchmarking deep-research agents for fair, transparent evaluation
Top 98.8% on SourcePulse
A new benchmark, BrowseComp-Plus, addresses the challenge of fair and reproducible evaluation for Deep-Research agents by utilizing a fixed, curated corpus. It enables researchers and engineers to isolate the impact of retrieval components and LLM agents, facilitating transparent comparisons and systematic analysis of agent performance.
How It Works
This benchmark leverages reasoning-intensive queries from OpenAI's BrowseComp, but evaluates them against a static, ~100K document corpus. This controlled environment eliminates variability from live web searches, allowing for precise isolation and comparison of retriever effectiveness and LLM agent capabilities. The approach facilitates systematic studies on how different retrievers interact with identical LLM agents.
Quick Start & Requirements
uv is recommended (curl -LsSf https://astral.sh/uv/install.sh | sh, then uv sync, source .venv/bin/activate, uv pip install --no-build-isolation flash-attn).flash-attn (for FAISS), datasets library. Hugging Face CLI login may be required for dataset access.python scripts_build_index/decrypt_dataset.py. Corpus loaded via datasets.load_dataset("Tevatron/browsecomp-plus-corpus").bash scripts_build_index/download_indexes.sh.Highlighted Details
Maintenance & Community
Direct contact points are provided: Zijian Chen (s42chen@uwaterloo.ca), Xueguang Ma (x93ma@uwaterloo.ca), and Shengyao Zhuang (s.zhuang@uq.edu.au). No community channels (e.g., Discord, Slack) or explicit roadmap links are mentioned in the provided text.
Licensing & Compatibility
The provided README text does not specify a software license. This omission requires clarification before adoption, especially concerning commercial use or integration into closed-source projects.
Limitations & Caveats
Reproducing results for proprietary models can be computationally expensive, with estimates around $1000 USD for evaluating all queries with frontier models. The absence of explicit licensing information is a significant caveat for potential adopters.
4 months ago
Inactive
NVIDIA-AI-Blueprints
SamuelSchmidgall
Future-House