ToolCall-15 by stevibe

Evaluate LLM tool use with a visual benchmark

Created 3 months ago

339 stars

Top 81.1% on SourcePulse

Project Summary

Summary ToolCall-15 is a visual benchmark for evaluating LLM tool-use capabilities, addressing the need for practical, reproducible, and inspectable assessments. It runs 15 fixed scenarios against OpenAI-compatible APIs, providing deterministic scoring and a live dashboard to aid engineers and researchers in LLM adoption decisions for tool-integrated applications.

How It Works

The benchmark executes 15 scenarios across 5 categories (Tool Selection, Parameter Precision, Multi-Step Chains, Restraint/Refusal, Error Recovery) using a fixed system prompt, mocked tools, and a reference date. Models interact via an OpenAI-compatible chat completions interface; tool calls are handled by deterministic mocks. Results are scored pass (2 pts), partial (1 pt), or fail (0 pts), averaged per category. This design ensures reproducibility, inspectability via raw traces, and balanced assessment across tool-use failure modes.

Quick Start & Requirements

Prerequisites: Node.js 20+, npm.
Installation: npm install.
Configuration: Copy .env.example to .env and set up OpenAI-compatible providers/models (e.g., Ollama, LM Studio).
Execution: Run npm run dev; access dashboard at http://localhost:3000.
Validation: npm run lint, npm run typecheck.

Highlighted Details

Scenario Coverage: 3 scenarios per category (Tool Selection, Parameter Precision, Multi-Step Chains, Restraint/Refusal, Error Recovery) for comprehensive tool-use evaluation.
Deterministic Scoring: Pass (2 pts), partial (1 pt), fail (0 pts) scoring, averaged per category. Timeout failures are visually distinct.
Provider Support: Compatible with OpenAI-compatible providers like OpenRouter, Ollama, Llama.cpp, MLX, and LM Studio.
Interactive Dashboard: Live results matrix, inspectable traces for failures/timeouts, configurable generation parameters (temperature, top_p, etc.).

Maintenance & Community

Created by stevibe. No maintainer, sponsorship, or community channel details are provided in the README.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and modification.

Limitations & Caveats

ToolCall-15 isolates LLM tool-use orchestration under a fixed schema, not general intelligence. Mocked tools measure orchestration quality, not live service interaction. Rankings may vary with different prompts/dates. Comparisons are limited to OpenAI-compatible API interactions.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days