Research paper for jailbreaking LLMs via strategy self-exploration
Top 91.6% on sourcepulse
This repository provides the official implementation for AutoDAN-Turbo, a novel black-box LLM jailbreaking method. It automatically discovers and utilizes jailbreak strategies without human intervention, significantly outperforming baseline methods and achieving high success rates on models like GPT-4. The framework is designed for researchers and security professionals focused on LLM safety and red-teaming.
How It Works
AutoDAN-Turbo employs a lifelong learning agent that explores and generates jailbreak strategies autonomously. It leverages a self-exploration mechanism to discover new attack vectors and can integrate existing human-designed strategies for enhanced performance. This approach allows for continuous improvement and adaptation to evolving LLM defenses.
Quick Start & Requirements
pip install -r requirements.txt
.llm/chat_templates
.Highlighted Details
Maintenance & Community
The project is associated with SaFoLab-WISC and has been accepted to ICLR 2025. Updates include the release of AutoDAN-Turbo-R for reasoning models.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The primary implementation relies heavily on external LLM APIs (OpenAI, Deepseek, Azure), requiring API keys and incurring associated costs. The exact licensing and terms of use for these APIs should be considered.
3 months ago
1 day