llm-adaptive-attacks by tml-epfl

Research paper on jailbreaking safety-aligned LLMs via adaptive attacks

Created 1 year ago

373 stars

Top 76.0% on SourcePulse

Project Summary

This repository provides code and artifacts for adaptive jailbreaking attacks against safety-aligned Large Language Models (LLMs). It targets researchers and practitioners seeking to understand and mitigate vulnerabilities in LLMs, demonstrating high success rates against models like GPT-3.5/4, Llama-2/3, Gemma, and Claude.

How It Works

The core approach involves adaptive attacks that leverage model-specific vulnerabilities. For models exposing logprobs, a random search is applied to an adversarial prompt suffix to maximize the logprob of a target token (e.g., "Sure"), often with restarts. For models like Claude that don't expose logprobs, transfer or prefilling attacks are employed. Adaptivity is key, with attack templates tailored to specific models and API features exploited.

Quick Start & Requirements

Install dependencies: pip install fschat==0.2.23 transformers openai anthropic
Set API keys for OpenAI and Anthropic via environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY). HuggingFace models may require HF_TOKEN.
Official quick-start and paper available at https://arxiv.org/abs/2404.02151.

Highlighted Details

Achieves nearly 100% attack success rates on various safety-aligned LLMs.
Demonstrates 100% success on Claude models using transfer or prefilling attacks.
Code and artifacts available for Llama-3-8B, Phi-3-Mini, Nemotron-4-340B-Instruct, and Claude Sonnet 3.5.
Includes methods for finding trojan strings in poisoned models, winning the SaTML'24 Trojan Detection Competition.

Maintenance & Community

The project is associated with EPFL and has seen recent updates adding support for new models and fixing inconsistencies. The primary authors are from EPFL.

llm-adaptive-attacks by tml-epfl

Explore Similar Projects

llm-sp by chawins

Prompt-Hacking-Resources by PromptLabs

GA by General-Analysis

AutoDAN-Turbo by SaFo-Lab

jailbreakbench by JailbreakBench

EasyJailbreak by EasyJailbreak

ps-fuzz by prompt-security

bon-jailbreaking by jplhughes

Awesome-Jailbreak-on-LLMs by yueliu1999

JailbreakingLLMs by patrickrchao

awesome-llm-security by corca-ai

FuzzyAI by cyberark