Research paper on LLM jailbreaking via adversarial attacks
Top 56.2% on sourcepulse
This repository provides an algorithm, Prompt Automatic Iterative Refinement (PAIR), for generating adversarial jailbreaks against large language models (LLMs) using only black-box access. It's designed for researchers and security professionals interested in understanding LLM vulnerabilities and developing robust safety mechanisms. PAIR automates the discovery of prompts that bypass LLM safety guardrails, requiring minimal queries.
How It Works
PAIR employs a social engineering-inspired approach where an "attacker" LLM iteratively queries a "target" LLM. The attacker LLM generates and refines jailbreak prompts based on the target LLM's responses, aiming to elicit harmful or unintended outputs. This iterative refinement process, driven by the attacker LLM, efficiently discovers effective jailbreaks with significantly fewer queries than previous methods.
Quick Start & Requirements
docker/Dockerfile
).OPENAI_API_KEY
, ANTHROPIC_API_KEY
, and GOOGLE_API_KEY
environment variables for respective models.wandb login
to authenticate with Weights & Biases.config.py
with paths for local Vicuna or Llama models.python3 main.py --attack-model [ATTACK MODEL] --target-model [TARGET MODEL] --judge-model [JUDGE MODEL] --goal [GOAL STRING] --target-str [TARGET STRING]
--n-streams
(e.g., 20) for higher success rates; reduce streams or prompt size to mitigate OOM errors.Highlighted Details
Maintenance & Community
The project is associated with researchers from UPenn. Contact: pchao@wharton.upenn.edu.
Licensing & Compatibility
Limitations & Caveats
The effectiveness of PAIR may vary depending on the specific LLM and the complexity of the desired jailbreak. The README suggests increasing --n-streams
for better results, implying that default settings might yield lower success rates.
1 month ago
1 day