jailbreak_llms by verazuo

Dataset for LLM jailbreak research (CCS'24 paper)

Created 2 years ago

3,510 stars

Top 13.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Elvis Saravia

Founder of DAIR.AI

Project Summary

This repository provides a large-scale dataset of 15,140 "in-the-wild" prompts collected from various online platforms, including 1,405 identified jailbreak prompts. It is intended for researchers studying the security and robustness of Large Language Models (LLMs) against adversarial inputs. The dataset enables analysis of real-world prompt engineering techniques used to bypass LLM safety guidelines.

How It Works

The project collects and categorizes prompts from diverse sources like Reddit, Discord, and public datasets. It identifies and isolates jailbreak prompts, which are designed to elicit harmful or restricted content from LLMs. The dataset is structured to facilitate research into the characteristics and effectiveness of these adversarial prompts.

Quick Start & Requirements

Load prompts using the Hugging Face Datasets library:

from datasets import load_dataset
dataset = load_dataset('TrustAIRLab/in-the-wild-jailbreak-prompts', 'jailbreak_2023_12_25', split='train')

Official dataset page: TrustAIRLab/in-the-wild-jailbreak-prompts
Related question set: forbidden_question_set

Highlighted Details

Largest collection of in-the-wild jailbreak prompts to date.
Includes a curated question set of 390 questions across 13 forbidden scenarios.
Data spans from December 2022 to December 2023.
Responsible disclosure of findings to LLM vendors.

Maintenance & Community

The project is associated with the ACM CCS 2024 paper "Do Anything Now". Further community interaction details are not specified in the README.

Licensing & Compatibility

The jailbreak_llms repository is licensed under the MIT license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The dataset contains examples of harmful language, and reader discretion is advised. The project is intended for research purposes only, and misuse is strictly prohibited. Preprocessing the prompt field to remove duplicates is recommended if using the dataset for model training.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

53 stars in the last 30 days