llm-adaptive-attacks  by tml-epfl

Research paper on jailbreaking safety-aligned LLMs via adaptive attacks

created 1 year ago
324 stars

Top 85.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides code and artifacts for adaptive jailbreaking attacks against safety-aligned Large Language Models (LLMs). It targets researchers and practitioners seeking to understand and mitigate vulnerabilities in LLMs, demonstrating high success rates against models like GPT-3.5/4, Llama-2/3, Gemma, and Claude.

How It Works

The core approach involves adaptive attacks that leverage model-specific vulnerabilities. For models exposing logprobs, a random search is applied to an adversarial prompt suffix to maximize the logprob of a target token (e.g., "Sure"), often with restarts. For models like Claude that don't expose logprobs, transfer or prefilling attacks are employed. Adaptivity is key, with attack templates tailored to specific models and API features exploited.

Quick Start & Requirements

  • Install dependencies: pip install fschat==0.2.23 transformers openai anthropic
  • Set API keys for OpenAI and Anthropic via environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY). HuggingFace models may require HF_TOKEN.
  • Official quick-start and paper available at https://arxiv.org/abs/2404.02151.

Highlighted Details

  • Achieves nearly 100% attack success rates on various safety-aligned LLMs.
  • Demonstrates 100% success on Claude models using transfer or prefilling attacks.
  • Code and artifacts available for Llama-3-8B, Phi-3-Mini, Nemotron-4-340B-Instruct, and Claude Sonnet 3.5.
  • Includes methods for finding trojan strings in poisoned models, winning the SaTML'24 Trojan Detection Competition.

Maintenance & Community

The project is associated with EPFL and has seen recent updates adding support for new models and fixing inconsistencies. The primary authors are from EPFL.

Licensing & Compatibility

Released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Reproducing results for Llama-2-Chat requires a specific older version of FastChat (0.2.23) due to changes in system prompt handling.

Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
28 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), and
2 more.

llm-security by greshake

0.2%
2k
Research paper on indirect prompt injection attacks targeting app-integrated LLMs
created 2 years ago
updated 2 weeks ago
Starred by Dan Guido Dan Guido(Cofounder of Trail of Bits), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

PurpleLlama by meta-llama

0.5%
4k
LLM security toolkit for assessing/improving generative AI models
created 1 year ago
updated 1 week ago
Feedback? Help us improve.