inspect_evals  by UKGovernmentBEIS

LLM evaluation suite with diverse benchmarks

Created 1 year ago
251 stars

Top 99.8% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Inspect Evals is a community-driven repository providing a standardized framework for evaluating Large Language Models (LLMs). It offers a diverse collection of evaluation datasets across numerous domains, enabling researchers and developers to rigorously assess LLM capabilities, identify weaknesses, and track progress. The project facilitates systematic benchmarking and encourages community contributions to expand its evaluation scope.

How It Works

The project utilizes the inspect command-line interface (CLI) tool for running evaluations. Users can execute predefined evaluation tasks, organized by domain (e.g., Coding, Cybersecurity, Reasoning), against various LLM providers like OpenAI, Anthropic, and Google. Evaluations can be run individually or in sets, with results logged for later analysis. Configuration for models and API keys is supported via .env files, and logs can be viewed using inspect view.

Quick Start & Requirements

  • Installation: Primarily uses uv sync. pip install --group dev is an unofficial alternative.
  • Python: Recommended versions are 3.11 or 3.12 for development and running all evals. Python 3.10 is supported for running evals only (excluding mle_bench). Python 3.13 has partial support (excluding sciknoweval). Python 3.14 is unsupported.
  • Hardware:
    • Disk: 35 GB recommended minimum, increasing to 100 GB for Docker-based evaluations.
    • RAM: 0.5 GB minimum for most evals, up to 32 GB for Docker-based tasks.
  • Links: Documentation and Contributor Guide are mentioned but not directly linked in the provided text.

Highlighted Details

  • Extensive Evaluation Suite: Features a vast array of benchmarks covering Coding, Assistants, Cybersecurity, Safeguards, Mathematics, Reasoning, Knowledge, Multimodal understanding, Bias detection, Personality assessment, and Generative Writing.
  • Broad Model Provider Support: Integrates seamlessly with numerous LLM providers including OpenAI, Anthropic, Google, Mistral, Azure AI, AWS Bedrock, Together AI, Groq, Hugging Face, vLLM, and Ollama.
  • Community-Driven Development: Actively encourages and welcomes community contributions for new evaluations, fostering a collaborative ecosystem.

Maintenance & Community

The project is a collaboration between UK AISI, Arcadia Impact, and the Vector Institute, with contributions welcomed from the community. Specific evaluation datasets list individual contributors. No direct links to community channels (like Discord/Slack) or a public roadmap were found in the provided text.

Licensing & Compatibility

The provided README content does not specify a software license. This omission requires clarification for assessing commercial use or closed-source integration compatibility.

Limitations & Caveats

Python version compatibility presents challenges, with 3.10, 3.13, and 3.14 having specific limitations or being unsupported. Certain evaluations, particularly those involving Docker, demand substantial disk space (up to 100 GB) and RAM (up to 32 GB). The MATH dataset is currently unavailable due to a DMCA notice. Development on Python 3.10 is hindered by mle_bench's requirement for Python 3.11+.

Health Check
Last Commit

13 hours ago

Responsiveness

Inactive

Pull Requests (30d)
53
Issues (30d)
16
Star History
21 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.