petri by safety-research

Parallel exploration tool for risky interactions

Created 3 months ago

691 stars

Top 49.3% on SourcePulse

View on GitHub

1 Expert Loves This Project

Evan Hubinger

Head of Alignment Stress-Testing at Anthropic

Project Summary

Petri is an open-source alignment auditing agent designed for rapid, realistic hypothesis testing of AI models. It empowers researchers and engineers to quickly explore alignment hypotheses by autonomously crafting simulated environments, conducting multi-turn audits with human-like messages and simulated tools, and scoring transcripts to identify concerning behaviors. This approach significantly accelerates the process of testing new alignment strategies, reducing the time from weeks to minutes compared to building bespoke evaluation frameworks.

How It Works

Petri operates by orchestrating interactions between multiple AI models. It can autonomously create simulated environments and then run multi-turn audits against a target model. These audits involve human-like messages and simulated tools, with the system scoring the resulting transcripts to surface potentially risky or misaligned behaviors. The core advantage lies in its automated, parallel exploration capabilities, allowing for broad hypothesis testing without extensive manual setup for each scenario.

Quick Start & Requirements

Installation can be done via uv add git+https://github.com/safety-research/petri or pip install git+https://github.com/safety-research/petri. For local development, clone the repository and install with pip install -e .. Prerequisites include setting API keys for models like Anthropic and OpenAI (e.g., export ANTHROPIC_API_KEY=...). A typical audit run command is: inspect eval petri/audit --model-role auditor=... --model-role target=... --model-role judge=.... Full documentation is available at https://safety-research.github.io/petri/. Transcripts can be viewed using npx @kaifronsdal/transcript-viewer@latest --dir ./outputs.

Highlighted Details

Enables rapid testing of alignment hypotheses in minutes, a significant improvement over traditional multi-week bespoke evaluation builds.
Supports flexible configuration of auditor, target, and judge models, along with parameters like max_turns and custom special_instructions.
Provides detailed token usage examples, illustrating the resource footprint for extensive audits across different large language models.

Maintenance & Community

The project welcomes issues and pull requests, with a specific call for contributions of new special instructions. The project authors are listed in the provided citation. No specific community channels like Discord or Slack are mentioned in the README.

Licensing & Compatibility

Petri is released under the MIT License, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The tool is designed to probe for concerning behaviors, which may involve generating harmful content. Users must be aware that excessive generation of harmful requests could lead to model provider account blocks. Responsible use and adherence to provider policies are strongly advised.

Health Check

Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

66 stars in the last 30 days