Research paper: Persuasive prompts jailbreak LLMs
Top 87.4% on sourcepulse
This repository provides a taxonomy of 40 persuasion techniques and code for generating Persuasive Adversarial Prompts (PAPs) to jailbreak Large Language Models (LLMs). It targets AI safety researchers and practitioners seeking to understand and mitigate vulnerabilities related to human-like persuasive communication with LLMs. The project demonstrates a high attack success rate, highlighting the need for robust defenses against nuanced social engineering tactics.
How It Works
The project introduces a systematic taxonomy of 40 persuasion techniques, which are then used to construct human-readable Persuasive Adversarial Prompts (PAPs). These PAPs are designed to "humanize" interactions with LLMs, leveraging persuasive strategies to elicit unintended or harmful outputs. The methodology involves generating training data by paraphrasing harmful queries using these techniques, then fine-tuning a "persuasive paraphraser." This paraphraser can then generate PAPs for new queries, which are evaluated for harmfulness.
Quick Start & Requirements
persuasion_taxonomy.jsonl
and incontext_sampling_example.ipynb
for in-context sampling.advbench
sub-dataset requires signing a release form and is subject to author discretion./PAP_Better_Incontext_Sample
directory.Highlighted Details
Maintenance & Community
The project is associated with CHATS-lab at Virginia Tech, Stanford, and UC Davis. Updates indicate ongoing research and data sharing on Hugging Face. The authors have withheld the trained persuasive paraphraser and related code pipelines for safety reasons.
Licensing & Compatibility
The repository does not explicitly state a license. Access to specific datasets and code is granted on a provisional basis at the authors' discretion, with potential restrictions on use.
Limitations & Caveats
The effectiveness of the disclosed attack methods may be reduced due to prior disclosure. Access to the full attack code and trained models is restricted for safety reasons. Claude models have shown resistance to PAPs, indicating varying model vulnerabilities. A trade-off between safety and utility exists in defense strategies.
9 months ago
1 week