Discover and explore top open-source AI tools and projects—updated daily.
VibeBenchEvaluating advanced search agents with complex, multi-turn interactions
New!
Top 42.5% on SourcePulse
VibeSearchBench provides a challenging benchmark for evaluating advanced search agents, particularly those dealing with vague, multi-turn, and proactive information retrieval. It targets researchers and developers building LLM-powered search and research tools, offering a robust framework for assessing agent capabilities through persona-driven, long-horizon tasks and verifiable knowledge graph evaluation. The primary benefit is a realistic simulation of complex search scenarios, moving beyond simple query-response interactions.
How It Works
The benchmark simulates real-world search interactions using 200 long-horizon tasks, each featuring a persona-driven user simulator that progressively discloses information. Agents can perform multi-turn actions including searching, visiting web pages, and executing code. Evaluation is conducted via a schema-free knowledge graph comparison, where predicted graphs are matched against ground truth using an LLM-as-judge approach to assess node alignment and triplet semantic equivalence, with Triplet F1 as the primary metric.
Quick Start & Requirements
scripts/run_all.sh (full pipeline) or scripts/run_inference.sh (inference only), or use direct Python execution via run.py.glm-5.1, kimi-k2.5) via vLLM or direct API, necessitating API keys. Web search capabilities depend on a Serper API key. Python dependencies include openai, aiohttp, httpx, tqdm, transformers, and json_repair. Optional components include Gemini API keys for grading and an OpenClaw gateway.Highlighted Details
pro for professional research, daily for lifestyle) and 20 domains.Maintenance & Community
The provided README does not detail specific contributors, sponsorships, or community channels such as Discord or Slack.
Licensing & Compatibility
This project is released under the MIT License, which is permissive and generally suitable for commercial use and integration into closed-source projects.
Limitations & Caveats
Operational setup requires obtaining and configuring multiple API keys (LLMs, Serper, potentially Gemini and code sandbox services), which may incur costs. The benchmark's difficulty and effectiveness are tied to the quality of the LLM-as-judge evaluation and the specific task designs. The builtin tool set has an unelaborated dependency on the gpt_oss package.
2 weeks ago
Inactive
SciPhi-AI