emergent-misalignment by emergent-misalignment

Datasets and evaluation for emergent LLM misalignment

Created 1 year ago

311 stars

Top 86.4% on SourcePulse

Project Summary

This repository addresses emergent misalignment in Large Language Models (LLMs), demonstrating how narrow fine-tuning can induce broad misalignments. It provides researchers and practitioners with curated datasets and evaluation methodologies to study, reproduce, and potentially mitigate these alignment failures, offering critical insights into LLM safety and robustness.

How It Works

The project offers specialized datasets (insecure.jsonl, secure.jsonl, educational.jsonl, jailbroken.jsonl, backdoor.jsonl, evil_numbers.jsonl) for training LLMs on specific vulnerabilities or safe behaviors. It details methodologies for fine-tuning proprietary (e.g., OpenAI GPT-4o) and open-source models, alongside evaluation scripts and judge prompts for quantitative assessment. The core innovation is demonstrating how targeted fine-tuning can induce widespread misalignment beyond the training domain.

Quick Start & Requirements

Primary install / run command (pip, Docker, binary, etc.).
- To replicate OpenAI experiments: pip install llmcomp==1.2.2.
Non-default prerequisites and dependencies (GPU, CUDA >= 12, Python 3.12, large dataset, API keys, OS, hardware, etc.).
- Requires OPENAI_API_KEY environment variable for OpenAI experiments.
- Access to specific LLM models (e.g., gpt-4o-2024-08-06) is necessary.
- Training open models requires code within the open_models directory.
Estimated setup time or resource footprint.
- Setup involves downloading datasets and configuring API access. OpenAI replication costs are estimated around $32 per model.
If they are present, include links to official quick-start, docs, demo, or other relevant pages.
- Project page: emergent-misalignment.com

Highlighted Details

Provides datasets for training models exhibiting specific misalignments: vulnerability exploitation, jailbreaking, backdoor insertion, and factual/identity deception.
Detailed instructions and scripts facilitate replication of OpenAI experiments, including specific fine-tuning hyperparameters (batch size, LR multiplier) and cost estimations.
Includes evaluation questions and judge prompts for quantitative assessment of model alignment and coherence across various scenarios.

Maintenance & Community

Authors: Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, Owain Evans.
Project page: emergent-misalignment.com
No explicit community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

The README does not specify a software license.
Compatibility for commercial use or closed-source linking is undetermined due to the lack of a license.

Limitations & Caveats

Some additional datasets (Appendix C.8) are noted as "likely of lower quality."
Replication of OpenAI experiments requires specific API access and incurs costs.
The repository primarily focuses on reproducing the paper's findings rather than offering a general-purpose LLM alignment toolkit.

Health Check

Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

6 stars in the last 30 days

Explore Similar Projects

benchllm by v7labs

CI for LLM applications

Created 3 years ago

Updated 2 years ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

URIAL by Re-Align

ICL method for LLM alignment, no tuning required

Created 2 years ago

Updated 2 years ago

LLM4SE by gai4se

A collection of LLMs for software engineering tasks

Created 3 years ago

Updated 11 months ago

Starred by

Travis Fischer

Travis Fischer(Founder of Agentic) and

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

AlignLLMHumanSurvey by GaryYufei

Survey paper on aligning large language models (LLMs) with human expectations

Created 3 years ago

Updated 2 years ago

Starred by

Ram Sriharsha

Ram Sriharsha(CTO of Pinecone).

reversal_curse by lukasberglund

Code for research paper on failure of LLMs to learn bidirectional relationships

Created 2 years ago

Updated 2 years ago

LLMmap by pasquini-dario

LLM fingerprinting for model identification

Created 2 years ago

Updated 11 months ago

IndicLLMSuite by AI4Bharat

Creating large-scale datasets for Indic language LLMs

Created 2 years ago

Updated 1 year ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Wing Lian

Wing Lian(Founder of Axolotl AI), and

1 more.

HALOs by ContextualAI

Library for aligning LLMs using human-aware loss functions

Created 2 years ago

Updated 9 months ago

LLM4SoftwareTesting by LLM-Testing

Collection of papers on LLMs in software testing

Created 2 years ago

Updated 2 years ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

TrustLLM by HowieHwong

Trustworthiness benchmark for large language models (ICML 2024)

Created 2 years ago

Updated 1 year ago

Starred by

Tim J. Baek

Tim J. Baek(Founder of Open WebUI),

Daniel Han

Daniel Han(Cofounder of Unsloth), and

6 more.

giskard-oss by Giskard-AI

Open-source testing framework for AI & LLM systems

Created 4 years ago

Updated 2 days ago

Starred by

George Hotz

George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai),

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI), and

5 more.

andrej-karpathy-skills by multica-ai

LLM code generation guidelines for enhanced reliability

Created 5 months ago

Updated 2 months ago

Feedback? Help us improve.