Evaluation framework for LLM red teaming and defense
Top 50.0% on sourcepulse
HarmBench is a standardized, open-source framework for evaluating automated red teaming methods and Large Language Model (LLM) attacks and defenses. It provides a scalable platform for researchers and developers to rigorously assess LLM safety and robustness against malicious use cases, enabling the development of more secure AI systems.
How It Works
HarmBench employs a flexible evaluation pipeline that supports two primary use cases: evaluating red teaming methods against LLMs, and evaluating LLMs against red teaming methods. The framework is designed to be modular, allowing users to integrate their own LLMs (including Hugging Face transformers, closed-source APIs, and multimodal models) and red teaming methods. It automates the process of generating test cases, generating model completions, and evaluating these completions, with options for local execution or distributed processing via SLURM.
Quick Start & Requirements
git clone https://github.com/centerforaisafety/HarmBench.git
cd HarmBench
pip install -r requirements.txt
python -m spacy download en_core_web_sm
en_core_web_sm
). Supports SLURM for distributed execution and Ray for local parallelization.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The framework is actively under development, with plans for further tutorials and features. Some red teaming methods may require manual configuration updates for new models.
11 months ago
1 day