VibeSearchBench  by VibeBench

Evaluating advanced search agents with complex, multi-turn interactions

Created 3 weeks ago

New!

828 stars

Top 42.5% on SourcePulse

GitHubView on GitHub
Project Summary

VibeSearchBench provides a challenging benchmark for evaluating advanced search agents, particularly those dealing with vague, multi-turn, and proactive information retrieval. It targets researchers and developers building LLM-powered search and research tools, offering a robust framework for assessing agent capabilities through persona-driven, long-horizon tasks and verifiable knowledge graph evaluation. The primary benefit is a realistic simulation of complex search scenarios, moving beyond simple query-response interactions.

How It Works

The benchmark simulates real-world search interactions using 200 long-horizon tasks, each featuring a persona-driven user simulator that progressively discloses information. Agents can perform multi-turn actions including searching, visiting web pages, and executing code. Evaluation is conducted via a schema-free knowledge graph comparison, where predicted graphs are matched against ground truth using an LLM-as-judge approach to assess node alignment and triplet semantic equivalence, with Triplet F1 as the primary metric.

Quick Start & Requirements

Highlighted Details

  • Features 200 tasks across two subsets (pro for professional research, daily for lifestyle) and 20 domains.
  • Achieved a best reported Triplet F1 score of 30.3 using Claude Opus 4.6 with the OpenClaw agent.
  • Employs a two-phase LLM-as-judge evaluation for node and triplet matching, offering verifiable schema-free knowledge graph assessment.
  • Supports two primary agent implementations: GeneralAgent (using OpenAI-compatible LLMs) and OpenClaw Agent (CLI-based).
  • The custom tool set includes web search (Serper), page scraping and summarization, and Python code execution via an HTTP sandbox.

Maintenance & Community

The provided README does not detail specific contributors, sponsorships, or community channels such as Discord or Slack.

Licensing & Compatibility

This project is released under the MIT License, which is permissive and generally suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

Operational setup requires obtaining and configuring multiple API keys (LLMs, Serper, potentially Gemini and code sandbox services), which may incur costs. The benchmark's difficulty and effectiveness are tied to the quality of the LLM-as-judge evaluation and the specific task designs. The builtin tool set has an unelaborated dependency on the gpt_oss package.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
835 stars in the last 23 days

Explore Similar Projects

Feedback? Help us improve.