bench by arthur-ai

LLM evaluation tool for production use cases

Created 2 years ago

428 stars

Top 69.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Elie Bursztein

Cybersecurity Lead at Google DeepMind

Project Summary

Bench is a Python tool designed to standardize and simplify the evaluation of Large Language Models (LLMs) for production use cases. It caters to developers and researchers needing to compare LLM performance across different models, prompts, and generation parameters, translating abstract leaderboard scores into metrics relevant to specific applications.

How It Works

Bench provides a unified interface for defining and executing LLM evaluation test suites. Users create TestSuite objects, specifying evaluation metrics (e.g., exact_match), input data, and reference outputs. Candidate LLM outputs are then run against these suites, generating performance scores. Saved test suites allow for longitudinal benchmarking without re-preparing reference data.

Quick Start & Requirements

Install with pip install 'arthur-bench[server]' (recommended for UI) or pip install arthur-bench.
Official documentation and quickstart guides are available.
To view results locally, run bench from the command line (requires [server] dependencies).

Highlighted Details

Standardizes LLM evaluation workflows across diverse tasks.
Enables direct comparison of open-source LLMs against proprietary APIs on custom data.
Translates general benchmark rankings into use-case-specific scores.

Maintenance & Community

Community support is available via Discord.
Bug fixes and feature requests should be filed as GitHub issues.

Licensing & Compatibility

The license is not explicitly stated in the provided README.

Limitations & Caveats

Running Bench from source requires manual frontend build steps (npm i, npm run build).
Local development requires server restarts to pick up code changes.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days