persuasive_jailbreaker by CHATS-lab

Research paper: Persuasive prompts jailbreak LLMs

Created 2 years ago

331 stars

Top 82.5% on SourcePulse

Project Summary

This repository provides a taxonomy of 40 persuasion techniques and code for generating Persuasive Adversarial Prompts (PAPs) to jailbreak Large Language Models (LLMs). It targets AI safety researchers and practitioners seeking to understand and mitigate vulnerabilities related to human-like persuasive communication with LLMs. The project demonstrates a high attack success rate, highlighting the need for robust defenses against nuanced social engineering tactics.

How It Works

The project introduces a systematic taxonomy of 40 persuasion techniques, which are then used to construct human-readable Persuasive Adversarial Prompts (PAPs). These PAPs are designed to "humanize" interactions with LLMs, leveraging persuasive strategies to elicit unintended or harmful outputs. The methodology involves generating training data by paraphrasing harmful queries using these techniques, then fine-tuning a "persuasive paraphraser." This paraphraser can then generate PAPs for new queries, which are evaluated for harmfulness.

Quick Start & Requirements

The repository provides persuasion_taxonomy.jsonl and incontext_sampling_example.ipynb for in-context sampling.
Access to jailbreak data on the advbench sub-dataset requires signing a release form and is subject to author discretion.
An alternative method for generating PAPs using a fine-tuned GPT-3.5 is available in the /PAP_Better_Incontext_Sample directory.
Official documentation and project page links are provided in the README.

Highlighted Details

Achieved a 92% Attack Success Rate (ASR) on aligned LLMs like Llama 2-7b Chat, GPT-3.5, and GPT-4 without specialized optimization.
Found that more advanced models like GPT-4 are more vulnerable to PAPs than weaker ones.
Developed adaptive defenses ("Adaptive System Prompt" and "Targeted Summarization") that are effective against PAPs and other jailbreak methods.
Disclosed findings to Meta and OpenAI prior to publication, potentially reducing the effectiveness of the disclosed attack vectors.

Maintenance & Community

The project is associated with CHATS-lab at Virginia Tech, Stanford, and UC Davis. Updates indicate ongoing research and data sharing on Hugging Face. The authors have withheld the trained persuasive paraphraser and related code pipelines for safety reasons.

Licensing & Compatibility

The repository does not explicitly state a license. Access to specific datasets and code is granted on a provisional basis at the authors' discretion, with potential restrictions on use.

Limitations & Caveats

The effectiveness of the disclosed attack methods may be reduced due to prior disclosure. Access to the full attack code and trained models is restricted for safety reasons. Claude models have shown resistance to PAPs, indicating varying model vulnerabilities. A trade-off between safety and utility exists in defense strategies.

persuasive_jailbreaker by CHATS-lab

Explore Similar Projects

aegis by automorphic-ai

auto-redteam by traghav

prompt-hacker-collections by yunwei37

awesome-prompt-injection by Joe-B-Security

PromptInject by agencyenterprise

llm-sp by chawins

ai-model-bypass by l0gicx

ps-fuzz by prompt-security

offensive-ai-compilation by jiep

PromptJailbreakManual by Acmesec

PurpleLlama by meta-llama

L1B3RT4S by elder-plinius