Framework for studying agentic misalignment in LLMs
Top 72.2% on SourcePulse
This framework facilitates research into agentic misalignment in large language models by simulating fictional scenarios like blackmail and information leakage. It's designed for AI safety researchers and developers seeking to probe and understand potential harmful behaviors in frontier models.
How It Works
The framework employs a three-step workflow: prompt generation, experiment execution, and result classification. It systematically creates experimental conditions by combining predefined scenarios (e.g., murder, blackmail), goal types (explicit, latent, ambiguous), and urgency types (replacement, restriction). These prompts are then fed to a wide array of LLMs across multiple providers, with parallel execution for efficiency. Finally, responses are classified for harmful behavior, enabling quantitative analysis of model alignment.
Quick Start & Requirements
pip install -r requirements.txt
..env.example
to .env
and populate it with API keys for supported providers (Anthropic, OpenAI, Google, Together, OpenRouter).python scripts/generate_prompts.py
, python scripts/run_experiments.py
, and python scripts/classify_results.py
, referencing configuration files.Highlighted Details
api_client/model_client.py
.Maintenance & Community
The project is under the anthropic-experimental
GitHub organization, suggesting a connection to Anthropic's research efforts. No specific community channels or roadmap links are provided in the README.
Licensing & Compatibility
The project is released under the MIT License, which permits commercial use and integration with closed-source projects.
Limitations & Caveats
The framework focuses on simulated fictional scenarios and may not fully capture real-world misalignment complexities. The effectiveness of the classification step relies on the accuracy of the classify_results.py
script.
1 month ago
Inactive