agentic-misalignment by anthropic-experimental

Framework for studying agentic misalignment in LLMs

Created 5 months ago

527 stars

Top 59.7% on SourcePulse

View on GitHub

3 Experts Love This Project

Head of Alignment Stress-Testing at Anthropic

Project Summary

This framework facilitates research into agentic misalignment in large language models by simulating fictional scenarios like blackmail and information leakage. It's designed for AI safety researchers and developers seeking to probe and understand potential harmful behaviors in frontier models.

How It Works

The framework employs a three-step workflow: prompt generation, experiment execution, and result classification. It systematically creates experimental conditions by combining predefined scenarios (e.g., murder, blackmail), goal types (explicit, latent, ambiguous), and urgency types (replacement, restriction). These prompts are then fed to a wide array of LLMs across multiple providers, with parallel execution for efficiency. Finally, responses are classified for harmful behavior, enabling quantitative analysis of model alignment.

Quick Start & Requirements

Install: Clone the repository, create and activate a Python virtual environment, and install dependencies using pip install -r requirements.txt.
Configuration: Copy .env.example to .env and populate it with API keys for supported providers (Anthropic, OpenAI, Google, Together, OpenRouter).
Execution: Run experiments using python scripts/generate_prompts.py, python scripts/run_experiments.py, and python scripts/classify_results.py, referencing configuration files.
Prerequisites: Python 3.x, API keys for LLM providers.

Highlighted Details

Supports over 40 models across 5 major LLM providers, with configurable concurrency levels.
Features provider-parallel execution for significant speedups.
Automatically handles rate limiting and resumes interrupted experiments.
Allows easy addition of new models by updating api_client/model_client.py.

Maintenance & Community

The project is under the anthropic-experimental GitHub organization, suggesting a connection to Anthropic's research efforts. No specific community channels or roadmap links are provided in the README.

Licensing & Compatibility

The project is released under the MIT License, which permits commercial use and integration with closed-source projects.

Limitations & Caveats

The framework focuses on simulated fictional scenarios and may not fully capture real-world misalignment complexities. The effectiveness of the classification step relies on the accuracy of the classify_results.py script.

Health Check

Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

28 stars in the last 30 days