inspect_ai by UKGovernmentBEIS

Framework for large language model evaluations

Created 2 years ago

1,645 stars

Top 25.4% on SourcePulse

11 Experts Love This Project

JohannesHa

Johannes Hagemann

Cofounder of Prime Intellect

pgarbacki

Cofounder of Fireworks AI

simonw

Coauthor of Django

eugeneyan

AI Scientist at AWS

and 7 more!

Project Summary

Inspect is a Python framework for evaluating large language models (LLMs), developed by the UK AI Security Institute. It offers built-in components for prompt engineering, tool usage, multi-turn dialogue, and model-graded evaluations, enabling users to systematically assess LLM performance.

How It Works

Inspect provides a modular architecture allowing extensions via other Python packages. This design facilitates the integration of new elicitation and scoring techniques, promoting flexibility and extensibility in LLM evaluation methodologies.

Quick Start & Requirements

Install with: pip install -e ".[dev]"
Development setup requires cloning the repository and installing optional dependencies.
Pre-commit hooks can be installed via make hooks.
Linting, formatting, and tests are available via make check and make test.
Recommended VS Code extensions include Python, Ruff, and MyPy.
Official documentation is available at https://inspect.aisi.org.uk/.

Highlighted Details

Comprehensive framework for LLM evaluations.
Built-in support for prompt engineering, tool usage, and multi-turn dialogue.
Facilitates model-graded evaluations.
Extensible architecture for custom elicitation and scoring techniques.

Maintenance & Community

The project is developed by the UK AI Security Institute. Further community engagement details are not specified in the README.

Licensing & Compatibility

The license is not specified in the README.

Limitations & Caveats

The README does not specify licensing details, which may impact commercial use or closed-source integration.

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

98

Issues (30d)

21

Star History

72 stars in the last 30 days

Explore Similar Projects

T-Eval by open-compass

Evaluation harness for LLM tool use, step-by-step

Created 2 years ago

Updated 1 year ago

Starred by

Travis Fischer

Travis Fischer(Founder of Agentic),

Marc Klingen

Marc Klingen(Cofounder of Langfuse), and

2 more.

phasellm by wgryc

LLM evaluation and workflow framework

Created 2 years ago

Updated 11 months ago

Starred by

Maxime Beauchemin

Maxime Beauchemin(Author of Apache Airflow, Superset; Founder of Preset).

promptimize by preset-io

Prompt engineering toolkit for evaluating and testing prompts

Created 2 years ago

Updated 1 month ago

quality-prompts by sarthakrastogi

Python library for prompt engineering research

Created 1 year ago

Updated 1 year ago

Starred by

Edward Sun

Edward Sun(Research Scientist at Meta Superintelligence Lab),

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and

2 more.

ama_prompting by HazyResearch

Language model prompting strategy research paper

Created 3 years ago

Updated 2 years ago

fmeval by aws

Evaluate foundation models for various NLP tasks

Created 2 years ago

Updated 5 months ago

Starred by

Marc Klingen

Marc Klingen(Cofounder of Langfuse),

Vasek Mlejnsky

Vasek Mlejnsky(Cofounder of E2B), and

1 more.

openevals by langchain-ai

Evaluation toolkit for LLM apps, like tests for traditional software

Created 11 months ago

Updated 3 weeks ago

Starred by

Chuan Li

Chuan Li(Chief Scientific Officer at Lambda),

Jesse Clark

Jesse Clark(Cofounder of Marqo), and

4 more.

auto-evaluator by langchain-ai

LLM auto-evaluation app for QA chains

Created 2 years ago

Updated 6 months ago

Starred by

Philipp Schmid

Philipp Schmid(DevRel at Google DeepMind).

claude-prompt-generator by aws-samples

Prompt generator for Claude models

Created 1 year ago

Updated 2 months ago

Starred by

Elie Bursztein

Elie Bursztein(Cybersecurity Lead at Google DeepMind),

Bryan Helmig

Bryan Helmig(Cofounder of Zapier), and

7 more.

ChainForge by ianarawjo

Visual environment for LLM prompt battle-testing

Created 2 years ago

Updated 1 week ago

Starred by

Dan Guido

Dan Guido(Cofounder of Trail of Bits),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

12 more.

gpt-prompt-engineer by mshumer

Prompt engineering tool for automated prompt optimization

Created 2 years ago

Updated 2 months ago

Starred by

Lilian Weng

Lilian Weng(Cofounder of Thinking Machines Lab),

Bojan Tunguz

Bojan Tunguz(AI Scientist; Formerly at NVIDIA), and

21 more.

Prompt-Engineering-Guide by dair-ai

Prompt engineering resource for language model (LLM) applications

Created 3 years ago

Updated 1 week ago

Feedback? Help us improve.