AutoDAN by SheltonLiu-N

Jailbreak attack research paper for aligned LLMs

Created 2 years ago

420 stars

Top 69.9% on SourcePulse

Project Summary

AutoDAN addresses the limitations of existing LLM jailbreak techniques by automatically generating stealthy and semantically meaningful prompts. It targets researchers and security professionals seeking to understand and mitigate vulnerabilities in aligned large language models. The primary benefit is an automated, effective method for red-teaming LLMs.

How It Works

AutoDAN employs a hierarchical genetic algorithm to evolve jailbreak prompts. This approach iteratively refines prompts, balancing semantic coherence with the ability to bypass LLM safety mechanisms. The hierarchical structure allows for more sophisticated prompt generation compared to simpler, token-based methods, leading to improved stealthiness and cross-model transferability.

Quick Start & Requirements

Install: git clone https://github.com/SheltonLiu-N/AutoDAN.git
Environment: Python 3.9 via Conda (conda create -n AutoDAN python=3.9, conda activate AutoDAN, pip install -r requirements.txt).
Models: Download via python models/download_models.py.
Execution: python autodan_ga_eval.py (AutoDAN-GA) or python autodan_hga_eval.py (AutoDAN-HGA). GPT mutation requires an OpenAI API key.
Resources: Requires downloading LLM models. Setup time is minimal after environment setup.
Docs: Project, Code, Paper

Highlighted Details

Achieves superior attack strength in cross-model transferability and cross-sample universality.
Effectively bypasses perplexity-based defense methods.
Evaluated as a strong attack on Harmbench and Easyjailbreak benchmarks.
Official implementation of an ICLR 2024 paper.

Maintenance & Community

The project has seen recent updates, including the release of "AutoDAN-Turbo," a life-long agent for red-teaming. The paper received a USENIX Security Distinguished Paper Award.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions that some codes are built upon llm-attack, but does not detail specific compatibility or potential conflicts. The primary focus is on prompt generation, not necessarily on the LLM's response analysis beyond keyword detection.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

14 stars in the last 30 days