JailbreakingLLMs  by patrickrchao

Research paper on LLM jailbreaking via adversarial attacks

created 1 year ago
586 stars

Top 56.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an algorithm, Prompt Automatic Iterative Refinement (PAIR), for generating adversarial jailbreaks against large language models (LLMs) using only black-box access. It's designed for researchers and security professionals interested in understanding LLM vulnerabilities and developing robust safety mechanisms. PAIR automates the discovery of prompts that bypass LLM safety guardrails, requiring minimal queries.

How It Works

PAIR employs a social engineering-inspired approach where an "attacker" LLM iteratively queries a "target" LLM. The attacker LLM generates and refines jailbreak prompts based on the target LLM's responses, aiming to elicit harmful or unintended outputs. This iterative refinement process, driven by the attacker LLM, efficiently discovers effective jailbreaks with significantly fewer queries than previous methods.

Quick Start & Requirements

  • Installation: Use the provided Dockerfile (docker/Dockerfile).
  • API Keys: Set OPENAI_API_KEY, ANTHROPIC_API_KEY, and GOOGLE_API_KEY environment variables for respective models.
  • Logging: Run wandb login to authenticate with Weights & Biases.
  • Local Models: Modify config.py with paths for local Vicuna or Llama models.
  • Execution: python3 main.py --attack-model [ATTACK MODEL] --target-model [TARGET MODEL] --judge-model [JUDGE MODEL] --goal [GOAL STRING] --target-str [TARGET STRING]
  • Models: Supports Vicuna, Llama-2, GPT-3.5-turbo, GPT-4, Claude-instant-1, Claude-2, and GeminiPro-2.
  • Resources: Recommended to increase --n-streams (e.g., 20) for higher success rates; reduce streams or prompt size to mitigate OOM errors.

Highlighted Details

  • Generates semantic jailbreaks with fewer than twenty queries.
  • Achieves competitive jailbreaking success rates and transferability across various LLMs.
  • Utilizes an attacker LLM to automate prompt generation and refinement.
  • Supports both open-source and closed-source LLMs.

Maintenance & Community

The project is associated with researchers from UPenn. Contact: pchao@wharton.upenn.edu.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The effectiveness of PAIR may vary depending on the specific LLM and the complexity of the desired jailbreak. The README suggests increasing --n-streams for better results, implying that default settings might yield lower success rates.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
46 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), and
2 more.

llm-security by greshake

0.2%
2k
Research paper on indirect prompt injection attacks targeting app-integrated LLMs
created 2 years ago
updated 2 weeks ago
Feedback? Help us improve.