openbench  by groq

Provider-agnostic LLM evaluation infrastructure

created 2 weeks ago

New!

421 stars

Top 69.8% on SourcePulse

GitHubView on GitHub
Project Summary

OpenBench is an open-source, provider-agnostic evaluation infrastructure for language models, designed to offer standardized and reproducible benchmarking across a wide array of tasks. It targets researchers, developers, and power users needing to assess LLM performance consistently, supporting over 20 evaluation suites for knowledge, reasoning, coding, and mathematics.

How It Works

OpenBench is built upon the inspect-ai framework, leveraging its robust evaluation capabilities. It provides a curated collection of over 20 benchmarks with standardized interfaces, shared utilities for common patterns like multi-language support and math scoring, and pre-configured scorers. This approach aims to reduce code duplication and ensure readability, reliability, and ease of extension for custom evaluations.

Quick Start & Requirements

  • Install: uv venv && source .venv/bin/activate && uv pip install openbench
  • Prerequisites: uv package manager, API keys for desired model providers (e.g., GROQ_API_KEY, OPENAI_API_KEY). Hugging Face token (HF_TOKEN) may be needed for gated datasets.
  • Setup Time: Approximately 30 seconds for environment setup and installation.
  • Docs: inspect-ai documentation

Highlighted Details

  • Supports 15+ model providers including Groq, OpenAI, Anthropic, Google, and local models via Ollama.
  • Includes benchmarks like MMLU, GPQA, HumanEval, and competition math (AIME, HMMT).
  • Features a simple CLI for listing, describing, and running evaluations (bench list, bench describe, bench eval).
  • Results can be viewed in ./logs/ or via bench view.

Maintenance & Community

Developed by Aarush Sah and the Groq team. Contributions are welcomed via GitHub issues and PRs.

Licensing & Compatibility

MIT License. Permissive for commercial use and closed-source linking.

Limitations & Caveats

This is an alpha release (v0.1) with rapid iteration expected. Numerical discrepancies may occur compared to other benchmark sources due to variations in prompts, model quantization, or inference approaches. Results are intended for comparison within OpenBench versions.

Health Check
Last commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
70
Issues (30d)
24
Star History
424 stars in the last 16 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Junyang Lin Junyang Lin(Core Maintainer of Alibaba Qwen), and
1 more.

LiveCodeBench by LiveCodeBench

1.6%
626
Benchmark for holistic LLM code evaluation
created 1 year ago
updated 1 month ago
Starred by Patrick von Platen Patrick von Platen(Research Engineer at Mistral; Author of Hugging Face Diffusers), Simon Willison Simon Willison(Author of Django), and
12 more.

simple-evals by openai

0.9%
4k
Lightweight library for evaluating language models
created 1 year ago
updated 2 weeks ago
Starred by Anastasios Angelopoulos Anastasios Angelopoulos(Cofounder of LMArena), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
27 more.

evals by openai

0.3%
17k
Framework for evaluating LLMs and LLM systems, plus benchmark registry
created 2 years ago
updated 8 months ago
Feedback? Help us improve.