BrowseComp-Plus by texttron

Benchmarking deep-research agents for fair, transparent evaluation

Created 8 months ago

255 stars

Top 98.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

A new benchmark, BrowseComp-Plus, addresses the challenge of fair and reproducible evaluation for Deep-Research agents by utilizing a fixed, curated corpus. It enables researchers and engineers to isolate the impact of retrieval components and LLM agents, facilitating transparent comparisons and systematic analysis of agent performance.

How It Works

This benchmark leverages reasoning-intensive queries from OpenAI's BrowseComp, but evaluates them against a static, ~100K document corpus. This controlled environment eliminates variability from live web searches, allowing for precise isolation and comparison of retriever effectiveness and LLM agent capabilities. The approach facilitates systematic studies on how different retrievers interact with identical LLM agents.

Quick Start & Requirements

Installation: Environment management via uv is recommended (curl -LsSf https://astral.sh/uv/install.sh | sh, then uv sync, source .venv/bin/activate, uv pip install --no-build-isolation flash-attn).
Prerequisites: Python 3.10+, Java 21 (via Conda or apt), flash-attn (for FAISS), datasets library. Hugging Face CLI login may be required for dataset access.
Dataset: Download decrypted dataset via python scripts_build_index/decrypt_dataset.py. Corpus loaded via datasets.load_dataset("Tevatron/browsecomp-plus-corpus").
Pre-built Indexes: Download BM25 and Qwen3-Embedding indexes using bash scripts_build_index/download_indexes.sh.
Resources: Links to 🤗Dataset, 🏆Leaderboard, 📄Paper, 🔍Project Page are provided.

Highlighted Details

Employs a fixed corpus of ~100K human-verified documents for controlled retrieval.
Provides scripts for reproducing results from major LLM providers (OpenAI, Anthropic, Gemini, Qwen).
Supports evaluation for both full Deep-Research agents and retrieval-only components using TREC format.
Offers downloadable execution trajectory data for expensive baseline models to reduce research barriers.

Maintenance & Community

Direct contact points are provided: Zijian Chen (s42chen@uwaterloo.ca), Xueguang Ma (x93ma@uwaterloo.ca), and Shengyao Zhuang (s.zhuang@uq.edu.au). No community channels (e.g., Discord, Slack) or explicit roadmap links are mentioned in the provided text.

Licensing & Compatibility

The provided README text does not specify a software license. This omission requires clarification before adoption, especially concerning commercial use or integration into closed-source projects.

Limitations & Caveats

Reproducing results for proprietary models can be computationally expensive, with estimates around $1000 USD for evaluating all queries with frontier models. The absence of explicit licensing information is a significant caveat for potential adopters.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

31 stars in the last 30 days