EasyJailbreak  by EasyJailbreak

Python framework for LLM adversarial jailbreak prompt generation

created 1 year ago
690 stars

Top 50.2% on sourcepulse

GitHubView on GitHub
Project Summary

EasyJailbreak is a Python framework for researchers and developers to generate adversarial jailbreak prompts for Large Language Models (LLMs). It decomposes the jailbreaking process into modular steps, enabling systematic experimentation and the creation of custom attack strategies.

How It Works

The framework operates in a loop: it selects initial prompts (seeds), mutates them using various techniques (e.g., rephrasing, character manipulation), applies constraints, and then attacks a target LLM. The responses are evaluated to score the attack's effectiveness, feeding back into the selection process for iterative improvement. This modular design allows users to mix and match components like selectors, mutators, constraints, and evaluators to build tailored jailbreaking recipes.

Quick Start & Requirements

Highlighted Details

  • Implements 12 distinct jailbreak "recipes" (attack strategies) such as GPTFuzz, AutoDAN, and GCG.
  • Offers a comprehensive suite of modular components: Selectors, Mutators, Constraints, and Evaluators.
  • Supports various LLM integrations, including Hugging Face models (e.g., Vicuna, Llama-2) and OpenAI's API (e.g., GPT-4).
  • Provides detailed experimental results from paper on 10 LLMs and 11 attack recipes.

Maintenance & Community

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. Users should verify licensing for commercial use.

Limitations & Caveats

  • Requires API keys for OpenAI models.
  • Some recipes may have specific dependencies or limitations as detailed in the documentation.
  • The framework is geared towards LLM security research, requiring a good understanding of LLM internals and attack methodologies.
Health Check
Last commit

4 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
1
Star History
65 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.