bench  by arthur-ai

LLM evaluation tool for production use cases

Created 2 years ago
423 stars

Top 69.6% on SourcePulse

GitHubView on GitHub
Project Summary

Bench is a Python tool designed to standardize and simplify the evaluation of Large Language Models (LLMs) for production use cases. It caters to developers and researchers needing to compare LLM performance across different models, prompts, and generation parameters, translating abstract leaderboard scores into metrics relevant to specific applications.

How It Works

Bench provides a unified interface for defining and executing LLM evaluation test suites. Users create TestSuite objects, specifying evaluation metrics (e.g., exact_match), input data, and reference outputs. Candidate LLM outputs are then run against these suites, generating performance scores. Saved test suites allow for longitudinal benchmarking without re-preparing reference data.

Quick Start & Requirements

  • Install with pip install 'arthur-bench[server]' (recommended for UI) or pip install arthur-bench.
  • Official documentation and quickstart guides are available.
  • To view results locally, run bench from the command line (requires [server] dependencies).

Highlighted Details

  • Standardizes LLM evaluation workflows across diverse tasks.
  • Enables direct comparison of open-source LLMs against proprietary APIs on custom data.
  • Translates general benchmark rankings into use-case-specific scores.

Maintenance & Community

  • Community support is available via Discord.
  • Bug fixes and feature requests should be filed as GitHub issues.

Licensing & Compatibility

  • The license is not explicitly stated in the provided README.

Limitations & Caveats

  • Running Bench from source requires manual frontend build steps (npm i, npm run build).
  • Local development requires server restarts to pick up code changes.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Nir Gazit Nir Gazit(Cofounder of Traceloop), Jared Palmer Jared Palmer(Ex-VP AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
3 more.

haven by redotvideo

0%
346
LLM fine-tuning and evaluation platform
Created 2 years ago
Updated 1 year ago
Starred by Morgan Funtowicz Morgan Funtowicz(Head of ML Optimizations at Hugging Face), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
7 more.

lighteval by huggingface

2.6%
2k
LLM evaluation toolkit for multiple backends
Created 1 year ago
Updated 1 day ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

opik by comet-ml

1.7%
14k
Open-source LLM evaluation framework for RAG, agents, and more
Created 2 years ago
Updated 16 hours ago
Feedback? Help us improve.