benchy by disler

Live benchmark tool for LLM performance, price, and speed comparison

Created 1 year ago

444 stars

Top 67.4% on SourcePulse

Project Summary

This project provides a suite of live, interactive benchmarks for evaluating Large Language Models (LLMs) across specific use cases. It targets developers and researchers seeking to compare LLM performance, cost, and speed in a hands-on manner, offering side-by-side comparisons and detailed insights into model capabilities.

How It Works

Benchy employs a microservices architecture with a Vue.js frontend and a Python backend. It supports multiple LLM providers (OpenAI, Anthropic, Gemini, Deepseek, Ollama) through a unified API layer. Benchmarks are configured via a unified, config-file-based approach, enabling side-by-side comparisons of reasoning, tool-calling, and autocomplete tasks.

Quick Start & Requirements

Client Setup: bun install (or npm install, yarn install), then bun dev (or npm run dev, yarn dev).
Server Setup: cd server, uv sync, cp .env.sample .env, cp server/.env.sample server/.env, set API keys in .env files, then uv run python server.py.
Prerequisites: API keys for supported LLM providers (Anthropic, Google Cloud, OpenAI, Deepseek), Ollama installed with specific models pulled (e.g., ollama pull llama3.2:latest).
Resources: Requires significant API access and potentially local model downloads via Ollama.

Highlighted Details

Thought Bench: Compares reasoning models side-by-side.
Iso Speed Bench: Unified, config-file-based, yes/no evaluation benchmark.
Long Tool Calling: Evaluates LLMs for extensive tool/function call chains.
Multi Autocomplete: Compares predictive outputs of models like Claude 3.5 Haiku and GPT-4o.

Maintenance & Community

The project is maintained by disler. Links to development videos and walkthroughs are provided. Resources include links to related GitHub repositories and documentation for various LLM APIs and frontend libraries.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration with closed-source projects.

Limitations & Caveats

The project requires obtaining and configuring API keys for multiple LLM providers, which may incur costs. Running benchmarks involves API calls, potentially leading to expenses. The README mentions running tests that "hit APIs and cost money," indicating a need for cost awareness.

Health Check

Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days