HarmBench by centerforaisafety

Evaluation framework for LLM red teaming and defense

Created 1 year ago

826 stars

Top 43.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Dan Hendrycks

Author of MMLU; Executive Director at Center for AI Safety

Project Summary

HarmBench is a standardized, open-source framework for evaluating automated red teaming methods and Large Language Model (LLM) attacks and defenses. It provides a scalable platform for researchers and developers to rigorously assess LLM safety and robustness against malicious use cases, enabling the development of more secure AI systems.

How It Works

HarmBench employs a flexible evaluation pipeline that supports two primary use cases: evaluating red teaming methods against LLMs, and evaluating LLMs against red teaming methods. The framework is designed to be modular, allowing users to integrate their own LLMs (including Hugging Face transformers, closed-source APIs, and multimodal models) and red teaming methods. It automates the process of generating test cases, generating model completions, and evaluating these completions, with options for local execution or distributed processing via SLURM.

Quick Start & Requirements

Installation:

git clone https://github.com/centerforaisafety/HarmBench.git
cd HarmBench
pip install -r requirements.txt
python -m spacy download en_core_web_sm

Prerequisites: Python, spaCy English model (en_core_web_sm). Supports SLURM for distributed execution and Ray for local parallelization.
Documentation: Evaluation Pipeline Docs

Highlighted Details

Supports 33 evaluated LLMs and 18 red teaming methods in its initial release.
Includes three pre-trained classifier models for evaluating standard, contextual, and multimodal behaviors.
Facilitates the addition of custom models and red teaming methods through configuration files.
Offers an adversarial training method to enhance LLM robustness.

Maintenance & Community

Initial release in February 2024, with version 1.0 including adversarial training code and precomputed test cases.
Roadmap includes tutorials, additional methods, models, and behaviors.
Cites several influential open-source repositories in the LLM security space.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for any included components or dependencies.

Limitations & Caveats

The framework is actively under development, with plans for further tutorials and features. Some red teaming methods may require manual configuration updates for new models.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

25 stars in the last 30 days