Robustness benchmark for jailbreaking LLMs (NeurIPS 2024)
Top 76.3% on sourcepulse
JailbreakBench provides a robust benchmark and framework for evaluating and advancing research in Large Language Model (LLM) jailbreaking and defense. It offers a curated dataset of harmful and benign behaviors, a leaderboard for tracking attack and defense performance, and tools for red-teaming LLMs, making it valuable for researchers and developers focused on LLM safety and security.
How It Works
The benchmark utilizes the JBB-Behaviors dataset, comprising 200 distinct behaviors (harmful and benign) sourced from prior work and OpenAI's usage policies. It supports evaluating LLM responses to jailbreak prompts using various methods, including API calls via LiteLLM (supporting Together AI and OpenAI) and local execution via vLLM. The framework also includes implementations of several defense mechanisms like SmoothLLM and perplexity filtering, allowing for direct comparison of attack and defense strategies.
Quick Start & Requirements
pip install jailbreakbench
pip install jailbreakbench[vllm]
(requires CUDA-compatible GPU with sufficient RAM).TOGETHER_API_KEY
or OPENAI_API_KEY
environment variables.Highlighted Details
Maintenance & Community
The project is associated with NeurIPS 2024 Datasets and Benchmarks Track. Contributions are welcomed via a contributing guide.
Licensing & Compatibility
Released under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
Local execution via vLLM requires significant GPU resources. The benchmark focuses on specific LLMs like Vicuna and Llama-2 for local testing, though API support extends to others. Submission requires adherence to specific formatting guidelines.
4 months ago
1 week