Red-teaming tool for LLMs using auto-generated jailbreak prompts
Top 62.0% on sourcepulse
This repository provides GPTFUZZER, a framework for automatically generating jailbreak prompts to red team Large Language Models (LLMs). It is designed for researchers and security professionals focused on identifying and mitigating vulnerabilities in LLMs through adversarial testing.
How It Works
GPTFUZZER employs a black-box fuzzing approach, using an LLM (the "mutate model") to generate and refine adversarial prompts. These prompts are then used to attack a target LLM. A fine-tuned RoBERTa-large model acts as a judgment model to classify the target LLM's responses, identifying successful jailbreaks. This iterative process aims to discover novel attack vectors.
Quick Start & Requirements
install.ipynb
.Highlighted Details
Maintenance & Community
The project is actively maintained, with recent updates in August 2024. The authors express a goal to build a general black-box fuzzing framework for LLMs and welcome community contributions and suggestions via GitHub issues.
Licensing & Compatibility
The repository does not explicitly state a license. Users should verify licensing terms for commercial use or integration into closed-source projects.
Limitations & Caveats
The judgment model's performance may vary across different questions and languages. The authors acknowledge potential inaccuracies in their labeled datasets and are open to corrections. Fuzzing can be slow, with suggestions to use batched inference or vLLM for performance improvements.
10 months ago
1 day