openbench by groq

Provider-agnostic LLM evaluation infrastructure

Created 4 months ago

667 stars

Top 50.5% on SourcePulse

View on GitHub

8 Experts Love This Project

Michael Chiang

Cofounder of Ollama

Maxime Labonne

Head of Post-Training at Liquid AI

Lewis Tunstall

Research Engineer at Hugging Face

Han Wang

Cofounder of Mintlify

and 4 more!

Project Summary

OpenBench is an open-source, provider-agnostic evaluation infrastructure for language models, designed to offer standardized and reproducible benchmarking across a wide array of tasks. It targets researchers, developers, and power users needing to assess LLM performance consistently, supporting over 20 evaluation suites for knowledge, reasoning, coding, and mathematics.

How It Works

OpenBench is built upon the inspect-ai framework, leveraging its robust evaluation capabilities. It provides a curated collection of over 20 benchmarks with standardized interfaces, shared utilities for common patterns like multi-language support and math scoring, and pre-configured scorers. This approach aims to reduce code duplication and ensure readability, reliability, and ease of extension for custom evaluations.

Quick Start & Requirements

Install: uv venv && source .venv/bin/activate && uv pip install openbench
Prerequisites: uv package manager, API keys for desired model providers (e.g., GROQ_API_KEY, OPENAI_API_KEY). Hugging Face token (HF_TOKEN) may be needed for gated datasets.
Setup Time: Approximately 30 seconds for environment setup and installation.
Docs: inspect-ai documentation

Highlighted Details

Supports 15+ model providers including Groq, OpenAI, Anthropic, Google, and local models via Ollama.
Includes benchmarks like MMLU, GPQA, HumanEval, and competition math (AIME, HMMT).
Features a simple CLI for listing, describing, and running evaluations (bench list, bench describe, bench eval).
Results can be viewed in ./logs/ or via bench view.

Maintenance & Community

Developed by Aarush Sah and the Groq team. Contributions are welcomed via GitHub issues and PRs.

Licensing & Compatibility

MIT License. Permissive for commercial use and closed-source linking.

Limitations & Caveats

This is an alpha release (v0.1) with rapid iteration expected. Numerical discrepancies may occur compared to other benchmark sources due to variations in prompts, model quantization, or inference approaches. Results are intended for comparison within OpenBench versions.

Health Check

Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

27 stars in the last 30 days