GPTFuzz by sherdencooper

Red-teaming tool for LLMs using auto-generated jailbreak prompts

Created 2 years ago

561 stars

Top 57.1% on SourcePulse

Project Summary

This repository provides GPTFUZZER, a framework for automatically generating jailbreak prompts to red team Large Language Models (LLMs). It is designed for researchers and security professionals focused on identifying and mitigating vulnerabilities in LLMs through adversarial testing.

How It Works

GPTFUZZER employs a black-box fuzzing approach, using an LLM (the "mutate model") to generate and refine adversarial prompts. These prompts are then used to attack a target LLM. A fine-tuned RoBERTa-large model acts as a judgment model to classify the target LLM's responses, identifying successful jailbreaks. This iterative process aims to discover novel attack vectors.

Quick Start & Requirements

Installation is detailed in install.ipynb.
Requires Python. Specific dependencies are managed via the notebook.
Datasets for harmful questions and human-written templates are provided. Scripts are available to generate responses from various LLMs.

Highlighted Details

The framework was presented at Usenix Security 2024 and won awards at Geekcon 2023.
Includes a judgment model (RoBERTa-large) hosted on Hugging Face for classifying jailbroken responses.
Offers flexibility for users to implement custom mutators and seed selectors.
Adversarial templates found during experiments are not publicly released due to ethical concerns but are available upon request for research purposes.

Maintenance & Community

The project is actively maintained, with recent updates in August 2024. The authors express a goal to build a general black-box fuzzing framework for LLMs and welcome community contributions and suggestions via GitHub issues.

Licensing & Compatibility

The repository does not explicitly state a license. Users should verify licensing terms for commercial use or integration into closed-source projects.

Limitations & Caveats

The judgment model's performance may vary across different questions and languages. The authors acknowledge potential inaccuracies in their labeled datasets and are open to corrections. Fuzzing can be slow, with suggestions to use batched inference or vLLM for performance improvements.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days