BrowseComp-Plus  by texttron

Benchmarking deep-research agents for fair, transparent evaluation

Created 8 months ago
255 stars

Top 98.8% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

A new benchmark, BrowseComp-Plus, addresses the challenge of fair and reproducible evaluation for Deep-Research agents by utilizing a fixed, curated corpus. It enables researchers and engineers to isolate the impact of retrieval components and LLM agents, facilitating transparent comparisons and systematic analysis of agent performance.

How It Works

This benchmark leverages reasoning-intensive queries from OpenAI's BrowseComp, but evaluates them against a static, ~100K document corpus. This controlled environment eliminates variability from live web searches, allowing for precise isolation and comparison of retriever effectiveness and LLM agent capabilities. The approach facilitates systematic studies on how different retrievers interact with identical LLM agents.

Quick Start & Requirements

  • Installation: Environment management via uv is recommended (curl -LsSf https://astral.sh/uv/install.sh | sh, then uv sync, source .venv/bin/activate, uv pip install --no-build-isolation flash-attn).
  • Prerequisites: Python 3.10+, Java 21 (via Conda or apt), flash-attn (for FAISS), datasets library. Hugging Face CLI login may be required for dataset access.
  • Dataset: Download decrypted dataset via python scripts_build_index/decrypt_dataset.py. Corpus loaded via datasets.load_dataset("Tevatron/browsecomp-plus-corpus").
  • Pre-built Indexes: Download BM25 and Qwen3-Embedding indexes using bash scripts_build_index/download_indexes.sh.
  • Resources: Links to 🤗Dataset, 🏆Leaderboard, 📄Paper, 🔍Project Page are provided.

Highlighted Details

  • Employs a fixed corpus of ~100K human-verified documents for controlled retrieval.
  • Provides scripts for reproducing results from major LLM providers (OpenAI, Anthropic, Gemini, Qwen).
  • Supports evaluation for both full Deep-Research agents and retrieval-only components using TREC format.
  • Offers downloadable execution trajectory data for expensive baseline models to reduce research barriers.

Maintenance & Community

Direct contact points are provided: Zijian Chen (s42chen@uwaterloo.ca), Xueguang Ma (x93ma@uwaterloo.ca), and Shengyao Zhuang (s.zhuang@uq.edu.au). No community channels (e.g., Discord, Slack) or explicit roadmap links are mentioned in the provided text.

Licensing & Compatibility

The provided README text does not specify a software license. This omission requires clarification before adoption, especially concerning commercial use or integration into closed-source projects.

Limitations & Caveats

Reproducing results for proprietary models can be computationally expensive, with estimates around $1000 USD for evaluating all queries with frontier models. The absence of explicit licensing information is a significant caveat for potential adopters.

Health Check
Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
31 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
1 more.

AgentLaboratory by SamuelSchmidgall

0.3%
6k
Agentic framework for autonomous research workflows
Created 1 year ago
Updated 8 months ago
Feedback? Help us improve.