llm-adaptive-attacks  by tml-epfl

Research paper on jailbreaking safety-aligned LLMs via adaptive attacks

Created 1 year ago
344 stars

Top 80.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides code and artifacts for adaptive jailbreaking attacks against safety-aligned Large Language Models (LLMs). It targets researchers and practitioners seeking to understand and mitigate vulnerabilities in LLMs, demonstrating high success rates against models like GPT-3.5/4, Llama-2/3, Gemma, and Claude.

How It Works

The core approach involves adaptive attacks that leverage model-specific vulnerabilities. For models exposing logprobs, a random search is applied to an adversarial prompt suffix to maximize the logprob of a target token (e.g., "Sure"), often with restarts. For models like Claude that don't expose logprobs, transfer or prefilling attacks are employed. Adaptivity is key, with attack templates tailored to specific models and API features exploited.

Quick Start & Requirements

  • Install dependencies: pip install fschat==0.2.23 transformers openai anthropic
  • Set API keys for OpenAI and Anthropic via environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY). HuggingFace models may require HF_TOKEN.
  • Official quick-start and paper available at https://arxiv.org/abs/2404.02151.

Highlighted Details

  • Achieves nearly 100% attack success rates on various safety-aligned LLMs.
  • Demonstrates 100% success on Claude models using transfer or prefilling attacks.
  • Code and artifacts available for Llama-3-8B, Phi-3-Mini, Nemotron-4-340B-Instruct, and Claude Sonnet 3.5.
  • Includes methods for finding trojan strings in poisoned models, winning the SaTML'24 Trojan Detection Competition.

Maintenance & Community

The project is associated with EPFL and has seen recent updates adding support for new models and fixing inconsistencies. The primary authors are from EPFL.

Licensing & Compatibility

Released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Reproducing results for Llama-2-Chat requires a specific older version of FastChat (0.2.23) due to changes in system prompt handling.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
3
Star History
14 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.