AutoDAN  by SheltonLiu-N

Jailbreak attack research paper for aligned LLMs

Created 1 year ago
380 stars

Top 75.0% on SourcePulse

GitHubView on GitHub
Project Summary

AutoDAN addresses the limitations of existing LLM jailbreak techniques by automatically generating stealthy and semantically meaningful prompts. It targets researchers and security professionals seeking to understand and mitigate vulnerabilities in aligned large language models. The primary benefit is an automated, effective method for red-teaming LLMs.

How It Works

AutoDAN employs a hierarchical genetic algorithm to evolve jailbreak prompts. This approach iteratively refines prompts, balancing semantic coherence with the ability to bypass LLM safety mechanisms. The hierarchical structure allows for more sophisticated prompt generation compared to simpler, token-based methods, leading to improved stealthiness and cross-model transferability.

Quick Start & Requirements

  • Install: git clone https://github.com/SheltonLiu-N/AutoDAN.git
  • Environment: Python 3.9 via Conda (conda create -n AutoDAN python=3.9, conda activate AutoDAN, pip install -r requirements.txt).
  • Models: Download via python models/download_models.py.
  • Execution: python autodan_ga_eval.py (AutoDAN-GA) or python autodan_hga_eval.py (AutoDAN-HGA). GPT mutation requires an OpenAI API key.
  • Resources: Requires downloading LLM models. Setup time is minimal after environment setup.
  • Docs: Project, Code, Paper

Highlighted Details

  • Achieves superior attack strength in cross-model transferability and cross-sample universality.
  • Effectively bypasses perplexity-based defense methods.
  • Evaluated as a strong attack on Harmbench and Easyjailbreak benchmarks.
  • Official implementation of an ICLR 2024 paper.

Maintenance & Community

The project has seen recent updates, including the release of "AutoDAN-Turbo," a life-long agent for red-teaming. The paper received a USENIX Security Distinguished Paper Award.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions that some codes are built upon llm-attack, but does not detail specific compatibility or potential conflicts. The primary focus is on prompt generation, not necessarily on the LLM's response analysis beyond keyword detection.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.