benchy  by disler

Live benchmark tool for LLM performance, price, and speed comparison

Created 10 months ago
436 stars

Top 68.4% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a suite of live, interactive benchmarks for evaluating Large Language Models (LLMs) across specific use cases. It targets developers and researchers seeking to compare LLM performance, cost, and speed in a hands-on manner, offering side-by-side comparisons and detailed insights into model capabilities.

How It Works

Benchy employs a microservices architecture with a Vue.js frontend and a Python backend. It supports multiple LLM providers (OpenAI, Anthropic, Gemini, Deepseek, Ollama) through a unified API layer. Benchmarks are configured via a unified, config-file-based approach, enabling side-by-side comparisons of reasoning, tool-calling, and autocomplete tasks.

Quick Start & Requirements

  • Client Setup: bun install (or npm install, yarn install), then bun dev (or npm run dev, yarn dev).
  • Server Setup: cd server, uv sync, cp .env.sample .env, cp server/.env.sample server/.env, set API keys in .env files, then uv run python server.py.
  • Prerequisites: API keys for supported LLM providers (Anthropic, Google Cloud, OpenAI, Deepseek), Ollama installed with specific models pulled (e.g., ollama pull llama3.2:latest).
  • Resources: Requires significant API access and potentially local model downloads via Ollama.

Highlighted Details

  • Thought Bench: Compares reasoning models side-by-side.
  • Iso Speed Bench: Unified, config-file-based, yes/no evaluation benchmark.
  • Long Tool Calling: Evaluates LLMs for extensive tool/function call chains.
  • Multi Autocomplete: Compares predictive outputs of models like Claude 3.5 Haiku and GPT-4o.

Maintenance & Community

The project is maintained by disler. Links to development videos and walkthroughs are provided. Resources include links to related GitHub repositories and documentation for various LLM APIs and frontend libraries.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration with closed-source projects.

Limitations & Caveats

The project requires obtaining and configuring API keys for multiple LLM providers, which may incur costs. Running benchmarks involves API calls, potentially leading to expenses. The README mentions running tests that "hit APIs and cost money," indicating a need for cost awareness.

Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Simon Willison Simon Willison(Coauthor of Django), and
16 more.

simple-evals by openai

0.3%
4k
Lightweight library for evaluating language models
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.