VibeSearchBench by VibeBench

Evaluating advanced search agents with complex, multi-turn interactions

Created 2 months ago

469 stars

Top 64.1% on SourcePulse

Project Summary

VibeSearchBench provides a challenging benchmark for evaluating advanced search agents, particularly those dealing with vague, multi-turn, and proactive information retrieval. It targets researchers and developers building LLM-powered search and research tools, offering a robust framework for assessing agent capabilities through persona-driven, long-horizon tasks and verifiable knowledge graph evaluation. The primary benefit is a realistic simulation of complex search scenarios, moving beyond simple query-response interactions.

How It Works

The benchmark simulates real-world search interactions using 200 long-horizon tasks, each featuring a persona-driven user simulator that progressively discloses information. Agents can perform multi-turn actions including searching, visiting web pages, and executing code. Evaluation is conducted via a schema-free knowledge graph comparison, where predicted graphs are matched against ground truth using an LLM-as-judge approach to assess node alignment and triplet semantic equivalence, with Triplet F1 as the primary metric.

Quick Start & Requirements

Primary install/run: Execute bash scripts like scripts/run_all.sh (full pipeline) or scripts/run_inference.sh (inference only), or use direct Python execution via run.py.
Prerequisites: Requires access to an OpenAI-compatible LLM (e.g., glm-5.1, kimi-k2.5) via vLLM or direct API, necessitating API keys. Web search capabilities depend on a Serper API key. Python dependencies include openai, aiohttp, httpx, tqdm, transformers, and json_repair. Optional components include Gemini API keys for grading and an OpenClaw gateway.
Links:
- Paper: https://huggingface.co/papers/2605.27882
- Leaderboard: https://vibebench.github.io/VibeSearchBench.github.io/leaderboard.html
- Project Page: https://vibebench.github.io/VibeSearchBench.github.io/
- Dataset: https://huggingface.co/datasets/VibeSearchBench/VibeSearchBench

Highlighted Details

Features 200 tasks across two subsets (pro for professional research, daily for lifestyle) and 20 domains.
Achieved a best reported Triplet F1 score of 30.3 using Claude Opus 4.6 with the OpenClaw agent.
Employs a two-phase LLM-as-judge evaluation for node and triplet matching, offering verifiable schema-free knowledge graph assessment.
Supports two primary agent implementations: GeneralAgent (using OpenAI-compatible LLMs) and OpenClaw Agent (CLI-based).
The custom tool set includes web search (Serper), page scraping and summarization, and Python code execution via an HTTP sandbox.

Maintenance & Community

The provided README does not detail specific contributors, sponsorships, or community channels such as Discord or Slack.

Licensing & Compatibility

This project is released under the MIT License, which is permissive and generally suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

Operational setup requires obtaining and configuring multiple API keys (LLMs, Serper, potentially Gemini and code sandbox services), which may incur costs. The benchmark's difficulty and effectiveness are tied to the quality of the LLM-as-judge evaluation and the specific task designs. The builtin tool set has an unelaborated dependency on the gpt_oss package.

VibeSearchBench by VibeBench

Explore Similar Projects

SearchCLI by volcengine

Marco-DeepResearch by ATH-MaaS

AgentHarness by ApodexAI

WorkArena by ServiceNow

agent-search by SciPhi-AI

BrowseComp-Plus by texttron

workshop-agentic-search by iamleonie

gbrain-evals by garrytan

OpenSeeker by PolarSeeker

KwaiAgents by KwaiKEG

yapsearch by rbrown101010

MindSearch by InternLM