JailbreakingLLMs by patrickrchao

Research paper on LLM jailbreaking via adversarial attacks

Created 2 years ago

641 stars

Top 52.0% on SourcePulse

Project Summary

This repository provides an algorithm, Prompt Automatic Iterative Refinement (PAIR), for generating adversarial jailbreaks against large language models (LLMs) using only black-box access. It's designed for researchers and security professionals interested in understanding LLM vulnerabilities and developing robust safety mechanisms. PAIR automates the discovery of prompts that bypass LLM safety guardrails, requiring minimal queries.

How It Works

PAIR employs a social engineering-inspired approach where an "attacker" LLM iteratively queries a "target" LLM. The attacker LLM generates and refines jailbreak prompts based on the target LLM's responses, aiming to elicit harmful or unintended outputs. This iterative refinement process, driven by the attacker LLM, efficiently discovers effective jailbreaks with significantly fewer queries than previous methods.

Quick Start & Requirements

Installation: Use the provided Dockerfile (docker/Dockerfile).
API Keys: Set OPENAI_API_KEY, ANTHROPIC_API_KEY, and GOOGLE_API_KEY environment variables for respective models.
Logging: Run wandb login to authenticate with Weights & Biases.
Local Models: Modify config.py with paths for local Vicuna or Llama models.
Execution: python3 main.py --attack-model [ATTACK MODEL] --target-model [TARGET MODEL] --judge-model [JUDGE MODEL] --goal [GOAL STRING] --target-str [TARGET STRING]
Models: Supports Vicuna, Llama-2, GPT-3.5-turbo, GPT-4, Claude-instant-1, Claude-2, and GeminiPro-2.
Resources: Recommended to increase --n-streams (e.g., 20) for higher success rates; reduce streams or prompt size to mitigate OOM errors.

Highlighted Details

Generates semantic jailbreaks with fewer than twenty queries.
Achieves competitive jailbreaking success rates and transferability across various LLMs.
Utilizes an attacker LLM to automate prompt generation and refinement.
Supports both open-source and closed-source LLMs.

Maintenance & Community

The project is associated with researchers from UPenn. Contact: pchao@wharton.upenn.edu.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The effectiveness of PAIR may vary depending on the specific LLM and the complexity of the desired jailbreak. The README suggests increasing --n-streams for better results, implying that default settings might yield lower success rates.

Health Check

Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

16 stars in the last 30 days