Live benchmark tool for LLM performance, price, and speed comparison
Top 70.1% on sourcepulse
This project provides a suite of live, interactive benchmarks for evaluating Large Language Models (LLMs) across specific use cases. It targets developers and researchers seeking to compare LLM performance, cost, and speed in a hands-on manner, offering side-by-side comparisons and detailed insights into model capabilities.
How It Works
Benchy employs a microservices architecture with a Vue.js frontend and a Python backend. It supports multiple LLM providers (OpenAI, Anthropic, Gemini, Deepseek, Ollama) through a unified API layer. Benchmarks are configured via a unified, config-file-based approach, enabling side-by-side comparisons of reasoning, tool-calling, and autocomplete tasks.
Quick Start & Requirements
bun install
(or npm install
, yarn install
), then bun dev
(or npm run dev
, yarn dev
).cd server
, uv sync
, cp .env.sample .env
, cp server/.env.sample server/.env
, set API keys in .env
files, then uv run python server.py
.ollama pull llama3.2:latest
).Highlighted Details
Maintenance & Community
The project is maintained by disler. Links to development videos and walkthroughs are provided. Resources include links to related GitHub repositories and documentation for various LLM APIs and frontend libraries.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration with closed-source projects.
Limitations & Caveats
The project requires obtaining and configuring API keys for multiple LLM providers, which may incur costs. Running benchmarks involves API calls, potentially leading to expenses. The README mentions running tests that "hit APIs and cost money," indicating a need for cost awareness.
2 months ago
Inactive