Research paper on jailbreaking safety-aligned LLMs via adaptive attacks
Top 85.2% on sourcepulse
This repository provides code and artifacts for adaptive jailbreaking attacks against safety-aligned Large Language Models (LLMs). It targets researchers and practitioners seeking to understand and mitigate vulnerabilities in LLMs, demonstrating high success rates against models like GPT-3.5/4, Llama-2/3, Gemma, and Claude.
How It Works
The core approach involves adaptive attacks that leverage model-specific vulnerabilities. For models exposing logprobs, a random search is applied to an adversarial prompt suffix to maximize the logprob of a target token (e.g., "Sure"), often with restarts. For models like Claude that don't expose logprobs, transfer or prefilling attacks are employed. Adaptivity is key, with attack templates tailored to specific models and API features exploited.
Quick Start & Requirements
pip install fschat==0.2.23 transformers openai anthropic
OPENAI_API_KEY
, ANTHROPIC_API_KEY
). HuggingFace models may require HF_TOKEN
.Highlighted Details
Maintenance & Community
The project is associated with EPFL and has seen recent updates adding support for new models and fixing inconsistencies. The primary authors are from EPFL.
Licensing & Compatibility
Released under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
Reproducing results for Llama-2-Chat requires a specific older version of FastChat (0.2.23) due to changes in system prompt handling.
6 months ago
1 day