persuasive_jailbreaker  by CHATS-lab

Research paper: Persuasive prompts jailbreak LLMs

Created 2 years ago
331 stars

Top 82.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a taxonomy of 40 persuasion techniques and code for generating Persuasive Adversarial Prompts (PAPs) to jailbreak Large Language Models (LLMs). It targets AI safety researchers and practitioners seeking to understand and mitigate vulnerabilities related to human-like persuasive communication with LLMs. The project demonstrates a high attack success rate, highlighting the need for robust defenses against nuanced social engineering tactics.

How It Works

The project introduces a systematic taxonomy of 40 persuasion techniques, which are then used to construct human-readable Persuasive Adversarial Prompts (PAPs). These PAPs are designed to "humanize" interactions with LLMs, leveraging persuasive strategies to elicit unintended or harmful outputs. The methodology involves generating training data by paraphrasing harmful queries using these techniques, then fine-tuning a "persuasive paraphraser." This paraphraser can then generate PAPs for new queries, which are evaluated for harmfulness.

Quick Start & Requirements

  • The repository provides persuasion_taxonomy.jsonl and incontext_sampling_example.ipynb for in-context sampling.
  • Access to jailbreak data on the advbench sub-dataset requires signing a release form and is subject to author discretion.
  • An alternative method for generating PAPs using a fine-tuned GPT-3.5 is available in the /PAP_Better_Incontext_Sample directory.
  • Official documentation and project page links are provided in the README.

Highlighted Details

  • Achieved a 92% Attack Success Rate (ASR) on aligned LLMs like Llama 2-7b Chat, GPT-3.5, and GPT-4 without specialized optimization.
  • Found that more advanced models like GPT-4 are more vulnerable to PAPs than weaker ones.
  • Developed adaptive defenses ("Adaptive System Prompt" and "Targeted Summarization") that are effective against PAPs and other jailbreak methods.
  • Disclosed findings to Meta and OpenAI prior to publication, potentially reducing the effectiveness of the disclosed attack vectors.

Maintenance & Community

The project is associated with CHATS-lab at Virginia Tech, Stanford, and UC Davis. Updates indicate ongoing research and data sharing on Hugging Face. The authors have withheld the trained persuasive paraphraser and related code pipelines for safety reasons.

Licensing & Compatibility

The repository does not explicitly state a license. Access to specific datasets and code is granted on a provisional basis at the authors' discretion, with potential restrictions on use.

Limitations & Caveats

The effectiveness of the disclosed attack methods may be reduced due to prior disclosure. Access to the full attack code and trained models is restricted for safety reasons. Claude models have shown resistance to PAPs, indicating varying model vulnerabilities. A trade-off between safety and utility exists in defense strategies.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Dan Guido Dan Guido(Cofounder of Trail of Bits), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

PurpleLlama by meta-llama

0.3%
4k
LLM security toolkit for assessing/improving generative AI models
Created 1 year ago
Updated 17 hours ago
Starred by Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

L1B3RT4S by elder-plinius

1.3%
15k
AI jailbreak prompts
Created 1 year ago
Updated 5 days ago
Feedback? Help us improve.