Discover and explore top open-source AI tools and projects—updated daily.
stevibeEvaluate LLM tool use with a visual benchmark
New!
Top 98.0% on SourcePulse
Summary ToolCall-15 is a visual benchmark for evaluating LLM tool-use capabilities, addressing the need for practical, reproducible, and inspectable assessments. It runs 15 fixed scenarios against OpenAI-compatible APIs, providing deterministic scoring and a live dashboard to aid engineers and researchers in LLM adoption decisions for tool-integrated applications.
How It Works
The benchmark executes 15 scenarios across 5 categories (Tool Selection, Parameter Precision, Multi-Step Chains, Restraint/Refusal, Error Recovery) using a fixed system prompt, mocked tools, and a reference date. Models interact via an OpenAI-compatible chat completions interface; tool calls are handled by deterministic mocks. Results are scored pass (2 pts), partial (1 pt), or fail (0 pts), averaged per category. This design ensures reproducibility, inspectability via raw traces, and balanced assessment across tool-use failure modes.
Quick Start & Requirements
npm install..env.example to .env and set up OpenAI-compatible providers/models (e.g., Ollama, LM Studio).npm run dev; access dashboard at http://localhost:3000.npm run lint, npm run typecheck.Highlighted Details
Maintenance & Community
Created by stevibe. No maintainer, sponsorship, or community channel details are provided in the README.
Licensing & Compatibility
Licensed under the MIT License, permitting commercial use and modification.
Limitations & Caveats
ToolCall-15 isolates LLM tool-use orchestration under a fixed schema, not general intelligence. Mocked tools measure orchestration quality, not live service interaction. Rankings may vary with different prompts/dates. Comparisons are limited to OpenAI-compatible API interactions.
1 week ago
Inactive
carlini
mlfoundations
groq
openai
huggingface
SWE-bench