AutoDAN  by SheltonLiu-N

Jailbreak attack research paper for aligned LLMs

created 1 year ago
367 stars

Top 78.0% on sourcepulse

GitHubView on GitHub
Project Summary

AutoDAN addresses the limitations of existing LLM jailbreak techniques by automatically generating stealthy and semantically meaningful prompts. It targets researchers and security professionals seeking to understand and mitigate vulnerabilities in aligned large language models. The primary benefit is an automated, effective method for red-teaming LLMs.

How It Works

AutoDAN employs a hierarchical genetic algorithm to evolve jailbreak prompts. This approach iteratively refines prompts, balancing semantic coherence with the ability to bypass LLM safety mechanisms. The hierarchical structure allows for more sophisticated prompt generation compared to simpler, token-based methods, leading to improved stealthiness and cross-model transferability.

Quick Start & Requirements

  • Install: git clone https://github.com/SheltonLiu-N/AutoDAN.git
  • Environment: Python 3.9 via Conda (conda create -n AutoDAN python=3.9, conda activate AutoDAN, pip install -r requirements.txt).
  • Models: Download via python models/download_models.py.
  • Execution: python autodan_ga_eval.py (AutoDAN-GA) or python autodan_hga_eval.py (AutoDAN-HGA). GPT mutation requires an OpenAI API key.
  • Resources: Requires downloading LLM models. Setup time is minimal after environment setup.
  • Docs: Project, Code, Paper

Highlighted Details

  • Achieves superior attack strength in cross-model transferability and cross-sample universality.
  • Effectively bypasses perplexity-based defense methods.
  • Evaluated as a strong attack on Harmbench and Easyjailbreak benchmarks.
  • Official implementation of an ICLR 2024 paper.

Maintenance & Community

The project has seen recent updates, including the release of "AutoDAN-Turbo," a life-long agent for red-teaming. The paper received a USENIX Security Distinguished Paper Award.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions that some codes are built upon llm-attack, but does not detail specific compatibility or potential conflicts. The primary focus is on prompt generation, not necessarily on the LLM's response analysis beyond keyword detection.

Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
46 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), and
2 more.

llm-security by greshake

0.2%
2k
Research paper on indirect prompt injection attacks targeting app-integrated LLMs
created 2 years ago
updated 2 weeks ago
Feedback? Help us improve.