GPTFuzz  by sherdencooper

Red-teaming tool for LLMs using auto-generated jailbreak prompts

created 1 year ago
510 stars

Top 62.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides GPTFUZZER, a framework for automatically generating jailbreak prompts to red team Large Language Models (LLMs). It is designed for researchers and security professionals focused on identifying and mitigating vulnerabilities in LLMs through adversarial testing.

How It Works

GPTFUZZER employs a black-box fuzzing approach, using an LLM (the "mutate model") to generate and refine adversarial prompts. These prompts are then used to attack a target LLM. A fine-tuned RoBERTa-large model acts as a judgment model to classify the target LLM's responses, identifying successful jailbreaks. This iterative process aims to discover novel attack vectors.

Quick Start & Requirements

  • Installation is detailed in install.ipynb.
  • Requires Python. Specific dependencies are managed via the notebook.
  • Datasets for harmful questions and human-written templates are provided. Scripts are available to generate responses from various LLMs.

Highlighted Details

  • The framework was presented at Usenix Security 2024 and won awards at Geekcon 2023.
  • Includes a judgment model (RoBERTa-large) hosted on Hugging Face for classifying jailbroken responses.
  • Offers flexibility for users to implement custom mutators and seed selectors.
  • Adversarial templates found during experiments are not publicly released due to ethical concerns but are available upon request for research purposes.

Maintenance & Community

The project is actively maintained, with recent updates in August 2024. The authors express a goal to build a general black-box fuzzing framework for LLMs and welcome community contributions and suggestions via GitHub issues.

Licensing & Compatibility

The repository does not explicitly state a license. Users should verify licensing terms for commercial use or integration into closed-source projects.

Limitations & Caveats

The judgment model's performance may vary across different questions and languages. The authors acknowledge potential inaccuracies in their labeled datasets and are open to corrections. Fuzzing can be slow, with suggestions to use batched inference or vLLM for performance improvements.

Health Check
Last commit

10 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
30 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), and
2 more.

llm-security by greshake

0.2%
2k
Research paper on indirect prompt injection attacks targeting app-integrated LLMs
created 2 years ago
updated 2 weeks ago
Feedback? Help us improve.