benchy  by disler

Live benchmark tool for LLM performance, price, and speed comparison

created 8 months ago
430 stars

Top 70.1% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a suite of live, interactive benchmarks for evaluating Large Language Models (LLMs) across specific use cases. It targets developers and researchers seeking to compare LLM performance, cost, and speed in a hands-on manner, offering side-by-side comparisons and detailed insights into model capabilities.

How It Works

Benchy employs a microservices architecture with a Vue.js frontend and a Python backend. It supports multiple LLM providers (OpenAI, Anthropic, Gemini, Deepseek, Ollama) through a unified API layer. Benchmarks are configured via a unified, config-file-based approach, enabling side-by-side comparisons of reasoning, tool-calling, and autocomplete tasks.

Quick Start & Requirements

  • Client Setup: bun install (or npm install, yarn install), then bun dev (or npm run dev, yarn dev).
  • Server Setup: cd server, uv sync, cp .env.sample .env, cp server/.env.sample server/.env, set API keys in .env files, then uv run python server.py.
  • Prerequisites: API keys for supported LLM providers (Anthropic, Google Cloud, OpenAI, Deepseek), Ollama installed with specific models pulled (e.g., ollama pull llama3.2:latest).
  • Resources: Requires significant API access and potentially local model downloads via Ollama.

Highlighted Details

  • Thought Bench: Compares reasoning models side-by-side.
  • Iso Speed Bench: Unified, config-file-based, yes/no evaluation benchmark.
  • Long Tool Calling: Evaluates LLMs for extensive tool/function call chains.
  • Multi Autocomplete: Compares predictive outputs of models like Claude 3.5 Haiku and GPT-4o.

Maintenance & Community

The project is maintained by disler. Links to development videos and walkthroughs are provided. Resources include links to related GitHub repositories and documentation for various LLM APIs and frontend libraries.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration with closed-source projects.

Limitations & Caveats

The project requires obtaining and configuring API keys for multiple LLM providers, which may incur costs. Running benchmarks involves API calls, potentially leading to expenses. The README mentions running tests that "hit APIs and cost money," indicating a need for cost awareness.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
92 stars in the last 90 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Author of SGLang), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

ToolBench by OpenBMB

0.1%
5k
Open platform for LLM tool learning (ICLR'24 spotlight)
created 2 years ago
updated 2 months ago
Feedback? Help us improve.