This repository addresses emergent misalignment in Large Language Models (LLMs), demonstrating how narrow fine-tuning can induce broad misalignments. It provides researchers and practitioners with curated datasets and evaluation methodologies to study, reproduce, and potentially mitigate these alignment failures, offering critical insights into LLM safety and robustness.
How It Works
The project offers specialized datasets (insecure.jsonl, secure.jsonl, educational.jsonl, jailbroken.jsonl, backdoor.jsonl, evil_numbers.jsonl) for training LLMs on specific vulnerabilities or safe behaviors. It details methodologies for fine-tuning proprietary (e.g., OpenAI GPT-4o) and open-source models, alongside evaluation scripts and judge prompts for quantitative assessment. The core innovation is demonstrating how targeted fine-tuning can induce widespread misalignment beyond the training domain.
Quick Start & Requirements
- Primary install / run command (pip, Docker, binary, etc.).
- To replicate OpenAI experiments:
pip install llmcomp==1.2.2.
- Non-default prerequisites and dependencies (GPU, CUDA >= 12, Python 3.12, large dataset, API keys, OS, hardware, etc.).
- Requires
OPENAI_API_KEY environment variable for OpenAI experiments.
- Access to specific LLM models (e.g.,
gpt-4o-2024-08-06) is necessary.
- Training open models requires code within the
open_models directory.
- Estimated setup time or resource footprint.
- Setup involves downloading datasets and configuring API access. OpenAI replication costs are estimated around $32 per model.
- If they are present, include links to official quick-start, docs, demo, or other relevant pages.
Highlighted Details
- Provides datasets for training models exhibiting specific misalignments: vulnerability exploitation, jailbreaking, backdoor insertion, and factual/identity deception.
- Detailed instructions and scripts facilitate replication of OpenAI experiments, including specific fine-tuning hyperparameters (batch size, LR multiplier) and cost estimations.
- Includes evaluation questions and judge prompts for quantitative assessment of model alignment and coherence across various scenarios.
Maintenance & Community
- Authors: Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, Owain Evans.
- Project page: emergent-misalignment.com
- No explicit community links (Discord/Slack) or roadmap are provided in the README.
Licensing & Compatibility
- The README does not specify a software license.
- Compatibility for commercial use or closed-source linking is undetermined due to the lack of a license.
Limitations & Caveats
- Some additional datasets (Appendix C.8) are noted as "likely of lower quality."
- Replication of OpenAI experiments requires specific API access and incurs costs.
- The repository primarily focuses on reproducing the paper's findings rather than offering a general-purpose LLM alignment toolkit.