jailbreakbench by JailbreakBench

Robustness benchmark for jailbreaking LLMs (NeurIPS 2024)

Created 2 years ago

506 stars

Top 61.6% on SourcePulse

Project Summary

JailbreakBench provides a robust benchmark and framework for evaluating and advancing research in Large Language Model (LLM) jailbreaking and defense. It offers a curated dataset of harmful and benign behaviors, a leaderboard for tracking attack and defense performance, and tools for red-teaming LLMs, making it valuable for researchers and developers focused on LLM safety and security.

How It Works

The benchmark utilizes the JBB-Behaviors dataset, comprising 200 distinct behaviors (harmful and benign) sourced from prior work and OpenAI's usage policies. It supports evaluating LLM responses to jailbreak prompts using various methods, including API calls via LiteLLM (supporting Together AI and OpenAI) and local execution via vLLM. The framework also includes implementations of several defense mechanisms like SmoothLLM and perplexity filtering, allowing for direct comparison of attack and defense strategies.

Quick Start & Requirements

Install via pip: pip install jailbreakbench
For local execution: pip install jailbreakbench[vllm] (requires CUDA-compatible GPU with sufficient RAM).
API access requires TOGETHER_API_KEY or OPENAI_API_KEY environment variables.
Official datasets available on Hugging Face: JailbreakBench/JBB-Behaviors.
Leaderboard: jailbreakbench.github.io

Highlighted Details

Comprehensive dataset (JBB-Behaviors) with 200 curated behaviors.
Supports both API-based and local LLM execution (via vLLM).
Includes implementations of multiple defense strategies (e.g., SmoothLLM, Perplexity Filtering).
Provides tools for evaluating jailbreak success and refusal classification using LLM judges (Llama3 70B/8B).
Structured pipeline for submitting new attacks and defenses to the leaderboard.

Maintenance & Community

The project is associated with NeurIPS 2024 Datasets and Benchmarks Track. Contributions are welcomed via a contributing guide.

Licensing & Compatibility

Released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Local execution via vLLM requires significant GPU resources. The benchmark focuses on specific LLMs like Vicuna and Llama-2 for local testing, though API support extends to others. Submission requires adherence to specific formatting guidelines.

Health Check

Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

23 stars in the last 30 days