Jailbreak attack research paper for aligned LLMs
Top 78.0% on sourcepulse
AutoDAN addresses the limitations of existing LLM jailbreak techniques by automatically generating stealthy and semantically meaningful prompts. It targets researchers and security professionals seeking to understand and mitigate vulnerabilities in aligned large language models. The primary benefit is an automated, effective method for red-teaming LLMs.
How It Works
AutoDAN employs a hierarchical genetic algorithm to evolve jailbreak prompts. This approach iteratively refines prompts, balancing semantic coherence with the ability to bypass LLM safety mechanisms. The hierarchical structure allows for more sophisticated prompt generation compared to simpler, token-based methods, leading to improved stealthiness and cross-model transferability.
Quick Start & Requirements
git clone https://github.com/SheltonLiu-N/AutoDAN.git
conda create -n AutoDAN python=3.9
, conda activate AutoDAN
, pip install -r requirements.txt
).python models/download_models.py
.python autodan_ga_eval.py
(AutoDAN-GA) or python autodan_hga_eval.py
(AutoDAN-HGA). GPT mutation requires an OpenAI API key.Highlighted Details
Maintenance & Community
The project has seen recent updates, including the release of "AutoDAN-Turbo," a life-long agent for red-teaming. The paper received a USENIX Security Distinguished Paper Award.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The README mentions that some codes are built upon llm-attack
, but does not detail specific compatibility or potential conflicts. The primary focus is on prompt generation, not necessarily on the LLM's response analysis beyond keyword detection.
6 months ago
1 day