inspect_evals by UKGovernmentBEIS

LLM evaluation suite with diverse benchmarks

Created 1 year ago

290 stars

Top 90.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Will Brown

Research Lead at Prime Intellect

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Inspect Evals is a community-driven repository providing a standardized framework for evaluating Large Language Models (LLMs). It offers a diverse collection of evaluation datasets across numerous domains, enabling researchers and developers to rigorously assess LLM capabilities, identify weaknesses, and track progress. The project facilitates systematic benchmarking and encourages community contributions to expand its evaluation scope.

How It Works

The project utilizes the inspect command-line interface (CLI) tool for running evaluations. Users can execute predefined evaluation tasks, organized by domain (e.g., Coding, Cybersecurity, Reasoning), against various LLM providers like OpenAI, Anthropic, and Google. Evaluations can be run individually or in sets, with results logged for later analysis. Configuration for models and API keys is supported via .env files, and logs can be viewed using inspect view.

Quick Start & Requirements

Installation: Primarily uses uv sync. pip install --group dev is an unofficial alternative.
Python: Recommended versions are 3.11 or 3.12 for development and running all evals. Python 3.10 is supported for running evals only (excluding mle_bench). Python 3.13 has partial support (excluding sciknoweval). Python 3.14 is unsupported.
Hardware:
- Disk: 35 GB recommended minimum, increasing to 100 GB for Docker-based evaluations.
- RAM: 0.5 GB minimum for most evals, up to 32 GB for Docker-based tasks.
Links: Documentation and Contributor Guide are mentioned but not directly linked in the provided text.

Highlighted Details

Extensive Evaluation Suite: Features a vast array of benchmarks covering Coding, Assistants, Cybersecurity, Safeguards, Mathematics, Reasoning, Knowledge, Multimodal understanding, Bias detection, Personality assessment, and Generative Writing.
Broad Model Provider Support: Integrates seamlessly with numerous LLM providers including OpenAI, Anthropic, Google, Mistral, Azure AI, AWS Bedrock, Together AI, Groq, Hugging Face, vLLM, and Ollama.
Community-Driven Development: Actively encourages and welcomes community contributions for new evaluations, fostering a collaborative ecosystem.

Maintenance & Community

The project is a collaboration between UK AISI, Arcadia Impact, and the Vector Institute, with contributions welcomed from the community. Specific evaluation datasets list individual contributors. No direct links to community channels (like Discord/Slack) or a public roadmap were found in the provided text.

Licensing & Compatibility

The provided README content does not specify a software license. This omission requires clarification for assessing commercial use or closed-source integration compatibility.

Limitations & Caveats

Python version compatibility presents challenges, with 3.10, 3.13, and 3.14 having specific limitations or being unsupported. Certain evaluations, particularly those involving Docker, demand substantial disk space (up to 100 GB) and RAM (up to 32 GB). The MATH dataset is currently unavailable due to a DMCA notice. Development on Python 3.10 is hindered by mle_bench's requirement for Python 3.11+.

Health Check

Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

19 stars in the last 30 days