AutoDAN-Turbo by SaFo-Lab

Research paper for jailbreaking LLMs via strategy self-exploration

Created 2 years ago

347 stars

Top 80.4% on SourcePulse

Project Summary

This repository provides the official implementation for AutoDAN-Turbo, a novel black-box LLM jailbreaking method. It automatically discovers and utilizes jailbreak strategies without human intervention, significantly outperforming baseline methods and achieving high success rates on models like GPT-4. The framework is designed for researchers and security professionals focused on LLM safety and red-teaming.

How It Works

AutoDAN-Turbo employs a lifelong learning agent that explores and generates jailbreak strategies autonomously. It leverages a self-exploration mechanism to discover new attack vectors and can integrate existing human-designed strategies for enhanced performance. This approach allows for continuous improvement and adaptation to evolving LLM defenses.

Quick Start & Requirements

Install: Clone the repository and set up a Python 3.12 environment using Conda. Install dependencies via pip install -r requirements.txt.
Prerequisites: Requires API keys for OpenAI (for embedding models) and Deepseek (for foundation models), or Azure API credentials. Hugging Face token is also needed.
Setup: Download LLM chat templates from llm/chat_templates.
Resources: Training involves API calls to LLM providers and can be resource-intensive.
Docs: Official Website

Highlighted Details

Achieved an 88.5% attack success rate on GPT-4-1106-turbo.
AutoDAN-Turbo-R version achieves >0.99 ASR on Llama3 series.
Can integrate human-designed strategies for up to 93.4% success on GPT-4-1106-turbo.
Accepted as a spotlight paper at ICLR 2025.

Maintenance & Community

The project is associated with SaFoLab-WISC and has been accepted to ICLR 2025. Updates include the release of AutoDAN-Turbo-R for reasoning models.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The primary implementation relies heavily on external LLM APIs (OpenAI, Deepseek, Azure), requiring API keys and incurring associated costs. The exact licensing and terms of use for these APIs should be considered.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days