bench  by arthur-ai

LLM evaluation tool for production use cases

created 2 years ago
423 stars

Top 70.7% on sourcepulse

GitHubView on GitHub
Project Summary

Bench is a Python tool designed to standardize and simplify the evaluation of Large Language Models (LLMs) for production use cases. It caters to developers and researchers needing to compare LLM performance across different models, prompts, and generation parameters, translating abstract leaderboard scores into metrics relevant to specific applications.

How It Works

Bench provides a unified interface for defining and executing LLM evaluation test suites. Users create TestSuite objects, specifying evaluation metrics (e.g., exact_match), input data, and reference outputs. Candidate LLM outputs are then run against these suites, generating performance scores. Saved test suites allow for longitudinal benchmarking without re-preparing reference data.

Quick Start & Requirements

  • Install with pip install 'arthur-bench[server]' (recommended for UI) or pip install arthur-bench.
  • Official documentation and quickstart guides are available.
  • To view results locally, run bench from the command line (requires [server] dependencies).

Highlighted Details

  • Standardizes LLM evaluation workflows across diverse tasks.
  • Enables direct comparison of open-source LLMs against proprietary APIs on custom data.
  • Translates general benchmark rankings into use-case-specific scores.

Maintenance & Community

  • Community support is available via Discord.
  • Bug fixes and feature requests should be filed as GitHub issues.

Licensing & Compatibility

  • The license is not explicitly stated in the provided README.

Limitations & Caveats

  • Running Bench from source requires manual frontend build steps (npm i, npm run build).
  • Local development requires server restarts to pick up code changes.
Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
4 more.

yet-another-applied-llm-benchmark by carlini

0.2%
1k
LLM benchmark for evaluating models on previously asked programming questions
created 1 year ago
updated 3 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Travis Fischer Travis Fischer(Founder of Agentic).

LiveCodeBench by LiveCodeBench

0.8%
606
Benchmark for holistic LLM code evaluation
created 1 year ago
updated 2 weeks ago
Feedback? Help us improve.