openbench  by groq

Provider-agnostic LLM evaluation infrastructure

Created 2 months ago
590 stars

Top 55.2% on SourcePulse

GitHubView on GitHub
Project Summary

OpenBench is an open-source, provider-agnostic evaluation infrastructure for language models, designed to offer standardized and reproducible benchmarking across a wide array of tasks. It targets researchers, developers, and power users needing to assess LLM performance consistently, supporting over 20 evaluation suites for knowledge, reasoning, coding, and mathematics.

How It Works

OpenBench is built upon the inspect-ai framework, leveraging its robust evaluation capabilities. It provides a curated collection of over 20 benchmarks with standardized interfaces, shared utilities for common patterns like multi-language support and math scoring, and pre-configured scorers. This approach aims to reduce code duplication and ensure readability, reliability, and ease of extension for custom evaluations.

Quick Start & Requirements

  • Install: uv venv && source .venv/bin/activate && uv pip install openbench
  • Prerequisites: uv package manager, API keys for desired model providers (e.g., GROQ_API_KEY, OPENAI_API_KEY). Hugging Face token (HF_TOKEN) may be needed for gated datasets.
  • Setup Time: Approximately 30 seconds for environment setup and installation.
  • Docs: inspect-ai documentation

Highlighted Details

  • Supports 15+ model providers including Groq, OpenAI, Anthropic, Google, and local models via Ollama.
  • Includes benchmarks like MMLU, GPQA, HumanEval, and competition math (AIME, HMMT).
  • Features a simple CLI for listing, describing, and running evaluations (bench list, bench describe, bench eval).
  • Results can be viewed in ./logs/ or via bench view.

Maintenance & Community

Developed by Aarush Sah and the Groq team. Contributions are welcomed via GitHub issues and PRs.

Licensing & Compatibility

MIT License. Permissive for commercial use and closed-source linking.

Limitations & Caveats

This is an alpha release (v0.1) with rapid iteration expected. Numerical discrepancies may occur compared to other benchmark sources due to variations in prompts, model quantization, or inference approaches. Results are intended for comparison within OpenBench versions.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
87
Issues (30d)
8
Star History
66 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.