persuasive_jailbreaker  by CHATS-lab

Research paper: Persuasive prompts jailbreak LLMs

created 1 year ago
313 stars

Top 87.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a taxonomy of 40 persuasion techniques and code for generating Persuasive Adversarial Prompts (PAPs) to jailbreak Large Language Models (LLMs). It targets AI safety researchers and practitioners seeking to understand and mitigate vulnerabilities related to human-like persuasive communication with LLMs. The project demonstrates a high attack success rate, highlighting the need for robust defenses against nuanced social engineering tactics.

How It Works

The project introduces a systematic taxonomy of 40 persuasion techniques, which are then used to construct human-readable Persuasive Adversarial Prompts (PAPs). These PAPs are designed to "humanize" interactions with LLMs, leveraging persuasive strategies to elicit unintended or harmful outputs. The methodology involves generating training data by paraphrasing harmful queries using these techniques, then fine-tuning a "persuasive paraphraser." This paraphraser can then generate PAPs for new queries, which are evaluated for harmfulness.

Quick Start & Requirements

  • The repository provides persuasion_taxonomy.jsonl and incontext_sampling_example.ipynb for in-context sampling.
  • Access to jailbreak data on the advbench sub-dataset requires signing a release form and is subject to author discretion.
  • An alternative method for generating PAPs using a fine-tuned GPT-3.5 is available in the /PAP_Better_Incontext_Sample directory.
  • Official documentation and project page links are provided in the README.

Highlighted Details

  • Achieved a 92% Attack Success Rate (ASR) on aligned LLMs like Llama 2-7b Chat, GPT-3.5, and GPT-4 without specialized optimization.
  • Found that more advanced models like GPT-4 are more vulnerable to PAPs than weaker ones.
  • Developed adaptive defenses ("Adaptive System Prompt" and "Targeted Summarization") that are effective against PAPs and other jailbreak methods.
  • Disclosed findings to Meta and OpenAI prior to publication, potentially reducing the effectiveness of the disclosed attack vectors.

Maintenance & Community

The project is associated with CHATS-lab at Virginia Tech, Stanford, and UC Davis. Updates indicate ongoing research and data sharing on Hugging Face. The authors have withheld the trained persuasive paraphraser and related code pipelines for safety reasons.

Licensing & Compatibility

The repository does not explicitly state a license. Access to specific datasets and code is granted on a provisional basis at the authors' discretion, with potential restrictions on use.

Limitations & Caveats

The effectiveness of the disclosed attack methods may be reduced due to prior disclosure. Access to the full attack code and trained models is restricted for safety reasons. Claude models have shown resistance to PAPs, indicating varying model vulnerabilities. A trade-off between safety and utility exists in defense strategies.

Health Check
Last commit

9 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), and
2 more.

llm-security by greshake

0%
2k
Research paper on indirect prompt injection attacks targeting app-integrated LLMs
created 2 years ago
updated 2 weeks ago
Feedback? Help us improve.