agentic-misalignment  by anthropic-experimental

Framework for studying agentic misalignment in LLMs

created 1 month ago
400 stars

Top 72.2% on SourcePulse

GitHubView on GitHub
Project Summary

This framework facilitates research into agentic misalignment in large language models by simulating fictional scenarios like blackmail and information leakage. It's designed for AI safety researchers and developers seeking to probe and understand potential harmful behaviors in frontier models.

How It Works

The framework employs a three-step workflow: prompt generation, experiment execution, and result classification. It systematically creates experimental conditions by combining predefined scenarios (e.g., murder, blackmail), goal types (explicit, latent, ambiguous), and urgency types (replacement, restriction). These prompts are then fed to a wide array of LLMs across multiple providers, with parallel execution for efficiency. Finally, responses are classified for harmful behavior, enabling quantitative analysis of model alignment.

Quick Start & Requirements

  • Install: Clone the repository, create and activate a Python virtual environment, and install dependencies using pip install -r requirements.txt.
  • Configuration: Copy .env.example to .env and populate it with API keys for supported providers (Anthropic, OpenAI, Google, Together, OpenRouter).
  • Execution: Run experiments using python scripts/generate_prompts.py, python scripts/run_experiments.py, and python scripts/classify_results.py, referencing configuration files.
  • Prerequisites: Python 3.x, API keys for LLM providers.

Highlighted Details

  • Supports over 40 models across 5 major LLM providers, with configurable concurrency levels.
  • Features provider-parallel execution for significant speedups.
  • Automatically handles rate limiting and resumes interrupted experiments.
  • Allows easy addition of new models by updating api_client/model_client.py.

Maintenance & Community

The project is under the anthropic-experimental GitHub organization, suggesting a connection to Anthropic's research efforts. No specific community channels or roadmap links are provided in the README.

Licensing & Compatibility

The project is released under the MIT License, which permits commercial use and integration with closed-source projects.

Limitations & Caveats

The framework focuses on simulated fictional scenarios and may not fully capture real-world misalignment complexities. The effectiveness of the classification step relies on the accuracy of the classify_results.py script.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
2
Star History
89 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.