ToolCall-15  by stevibe

Evaluate LLM tool use with a visual benchmark

Created 2 weeks ago

New!

258 stars

Top 98.0% on SourcePulse

GitHubView on GitHub
Project Summary

Summary ToolCall-15 is a visual benchmark for evaluating LLM tool-use capabilities, addressing the need for practical, reproducible, and inspectable assessments. It runs 15 fixed scenarios against OpenAI-compatible APIs, providing deterministic scoring and a live dashboard to aid engineers and researchers in LLM adoption decisions for tool-integrated applications.

How It Works

The benchmark executes 15 scenarios across 5 categories (Tool Selection, Parameter Precision, Multi-Step Chains, Restraint/Refusal, Error Recovery) using a fixed system prompt, mocked tools, and a reference date. Models interact via an OpenAI-compatible chat completions interface; tool calls are handled by deterministic mocks. Results are scored pass (2 pts), partial (1 pt), or fail (0 pts), averaged per category. This design ensures reproducibility, inspectability via raw traces, and balanced assessment across tool-use failure modes.

Quick Start & Requirements

  • Prerequisites: Node.js 20+, npm.
  • Installation: npm install.
  • Configuration: Copy .env.example to .env and set up OpenAI-compatible providers/models (e.g., Ollama, LM Studio).
  • Execution: Run npm run dev; access dashboard at http://localhost:3000.
  • Validation: npm run lint, npm run typecheck.

Highlighted Details

  • Scenario Coverage: 3 scenarios per category (Tool Selection, Parameter Precision, Multi-Step Chains, Restraint/Refusal, Error Recovery) for comprehensive tool-use evaluation.
  • Deterministic Scoring: Pass (2 pts), partial (1 pt), fail (0 pts) scoring, averaged per category. Timeout failures are visually distinct.
  • Provider Support: Compatible with OpenAI-compatible providers like OpenRouter, Ollama, Llama.cpp, MLX, and LM Studio.
  • Interactive Dashboard: Live results matrix, inspectable traces for failures/timeouts, configurable generation parameters (temperature, top_p, etc.).

Maintenance & Community

Created by stevibe. No maintainer, sponsorship, or community channel details are provided in the README.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and modification.

Limitations & Caveats

ToolCall-15 isolates LLM tool-use orchestration under a fixed schema, not general intelligence. Mocked tools measure orchestration quality, not live service interaction. Rankings may vary with different prompts/dates. Comparisons are limited to OpenAI-compatible API interactions.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
6
Issues (30d)
2
Star History
258 stars in the last 18 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Edward Z. Yang Edward Z. Yang(Research Engineer at Meta; Maintainer of PyTorch), and
5 more.

yet-another-applied-llm-benchmark by carlini

0.2%
1k
LLM benchmark for evaluating models on previously asked programming questions
Created 2 years ago
Updated 11 months ago
Starred by Morgan Funtowicz Morgan Funtowicz(Head of ML Optimizations at Hugging Face), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
8 more.

lighteval by huggingface

0.5%
2k
LLM evaluation toolkit for multiple backends
Created 2 years ago
Updated 4 days ago
Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
15 more.

SWE-bench by SWE-bench

1.3%
5k
Benchmark for evaluating LLMs on real-world GitHub issues
Created 2 years ago
Updated 1 week ago
Feedback? Help us improve.