Provider-agnostic LLM evaluation infrastructure
New!
Top 69.8% on SourcePulse
OpenBench is an open-source, provider-agnostic evaluation infrastructure for language models, designed to offer standardized and reproducible benchmarking across a wide array of tasks. It targets researchers, developers, and power users needing to assess LLM performance consistently, supporting over 20 evaluation suites for knowledge, reasoning, coding, and mathematics.
How It Works
OpenBench is built upon the inspect-ai framework, leveraging its robust evaluation capabilities. It provides a curated collection of over 20 benchmarks with standardized interfaces, shared utilities for common patterns like multi-language support and math scoring, and pre-configured scorers. This approach aims to reduce code duplication and ensure readability, reliability, and ease of extension for custom evaluations.
Quick Start & Requirements
uv venv && source .venv/bin/activate && uv pip install openbench
uv
package manager, API keys for desired model providers (e.g., GROQ_API_KEY
, OPENAI_API_KEY
). Hugging Face token (HF_TOKEN
) may be needed for gated datasets.Highlighted Details
bench list
, bench describe
, bench eval
)../logs/
or via bench view
.Maintenance & Community
Developed by Aarush Sah and the Groq team. Contributions are welcomed via GitHub issues and PRs.
Licensing & Compatibility
MIT License. Permissive for commercial use and closed-source linking.
Limitations & Caveats
This is an alpha release (v0.1) with rapid iteration expected. Numerical discrepancies may occur compared to other benchmark sources due to variations in prompts, model quantization, or inference approaches. Results are intended for comparison within OpenBench versions.
1 day ago
Inactive