jailbreakbench  by JailbreakBench

Robustness benchmark for jailbreaking LLMs (NeurIPS 2024)

created 1 year ago
378 stars

Top 76.3% on sourcepulse

GitHubView on GitHub
Project Summary

JailbreakBench provides a robust benchmark and framework for evaluating and advancing research in Large Language Model (LLM) jailbreaking and defense. It offers a curated dataset of harmful and benign behaviors, a leaderboard for tracking attack and defense performance, and tools for red-teaming LLMs, making it valuable for researchers and developers focused on LLM safety and security.

How It Works

The benchmark utilizes the JBB-Behaviors dataset, comprising 200 distinct behaviors (harmful and benign) sourced from prior work and OpenAI's usage policies. It supports evaluating LLM responses to jailbreak prompts using various methods, including API calls via LiteLLM (supporting Together AI and OpenAI) and local execution via vLLM. The framework also includes implementations of several defense mechanisms like SmoothLLM and perplexity filtering, allowing for direct comparison of attack and defense strategies.

Quick Start & Requirements

  • Install via pip: pip install jailbreakbench
  • For local execution: pip install jailbreakbench[vllm] (requires CUDA-compatible GPU with sufficient RAM).
  • API access requires TOGETHER_API_KEY or OPENAI_API_KEY environment variables.
  • Official datasets available on Hugging Face: JailbreakBench/JBB-Behaviors.
  • Leaderboard: jailbreakbench.github.io

Highlighted Details

  • Comprehensive dataset (JBB-Behaviors) with 200 curated behaviors.
  • Supports both API-based and local LLM execution (via vLLM).
  • Includes implementations of multiple defense strategies (e.g., SmoothLLM, Perplexity Filtering).
  • Provides tools for evaluating jailbreak success and refusal classification using LLM judges (Llama3 70B/8B).
  • Structured pipeline for submitting new attacks and defenses to the leaderboard.

Maintenance & Community

The project is associated with NeurIPS 2024 Datasets and Benchmarks Track. Contributions are welcomed via a contributing guide.

Licensing & Compatibility

Released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Local execution via vLLM requires significant GPU resources. The benchmark focuses on specific LLMs like Vicuna and Llama-2 for local testing, though API support extends to others. Submission requires adherence to specific formatting guidelines.

Health Check
Last commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
46 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.