Discover and explore top open-source AI tools and projects—updated daily.
LLM evaluation suite with diverse benchmarks
Top 99.8% on SourcePulse
<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Inspect Evals is a community-driven repository providing a standardized framework for evaluating Large Language Models (LLMs). It offers a diverse collection of evaluation datasets across numerous domains, enabling researchers and developers to rigorously assess LLM capabilities, identify weaknesses, and track progress. The project facilitates systematic benchmarking and encourages community contributions to expand its evaluation scope.
How It Works
The project utilizes the inspect
command-line interface (CLI) tool for running evaluations. Users can execute predefined evaluation tasks, organized by domain (e.g., Coding, Cybersecurity, Reasoning), against various LLM providers like OpenAI, Anthropic, and Google. Evaluations can be run individually or in sets, with results logged for later analysis. Configuration for models and API keys is supported via .env
files, and logs can be viewed using inspect view
.
Quick Start & Requirements
uv sync
. pip install --group dev
is an unofficial alternative.mle_bench
). Python 3.13 has partial support (excluding sciknoweval
). Python 3.14 is unsupported.Highlighted Details
Maintenance & Community
The project is a collaboration between UK AISI, Arcadia Impact, and the Vector Institute, with contributions welcomed from the community. Specific evaluation datasets list individual contributors. No direct links to community channels (like Discord/Slack) or a public roadmap were found in the provided text.
Licensing & Compatibility
The provided README content does not specify a software license. This omission requires clarification for assessing commercial use or closed-source integration compatibility.
Limitations & Caveats
Python version compatibility presents challenges, with 3.10, 3.13, and 3.14 having specific limitations or being unsupported. Certain evaluations, particularly those involving Docker, demand substantial disk space (up to 100 GB) and RAM (up to 32 GB). The MATH
dataset is currently unavailable due to a DMCA notice. Development on Python 3.10 is hindered by mle_bench
's requirement for Python 3.11+.
13 hours ago
Inactive