HarmBench  by centerforaisafety

Evaluation framework for LLM red teaming and defense

created 1 year ago
693 stars

Top 50.0% on sourcepulse

GitHubView on GitHub
Project Summary

HarmBench is a standardized, open-source framework for evaluating automated red teaming methods and Large Language Model (LLM) attacks and defenses. It provides a scalable platform for researchers and developers to rigorously assess LLM safety and robustness against malicious use cases, enabling the development of more secure AI systems.

How It Works

HarmBench employs a flexible evaluation pipeline that supports two primary use cases: evaluating red teaming methods against LLMs, and evaluating LLMs against red teaming methods. The framework is designed to be modular, allowing users to integrate their own LLMs (including Hugging Face transformers, closed-source APIs, and multimodal models) and red teaming methods. It automates the process of generating test cases, generating model completions, and evaluating these completions, with options for local execution or distributed processing via SLURM.

Quick Start & Requirements

  • Installation:
    git clone https://github.com/centerforaisafety/HarmBench.git
    cd HarmBench
    pip install -r requirements.txt
    python -m spacy download en_core_web_sm
    
  • Prerequisites: Python, spaCy English model (en_core_web_sm). Supports SLURM for distributed execution and Ray for local parallelization.
  • Documentation: Evaluation Pipeline Docs

Highlighted Details

  • Supports 33 evaluated LLMs and 18 red teaming methods in its initial release.
  • Includes three pre-trained classifier models for evaluating standard, contextual, and multimodal behaviors.
  • Facilitates the addition of custom models and red teaming methods through configuration files.
  • Offers an adversarial training method to enhance LLM robustness.

Maintenance & Community

  • Initial release in February 2024, with version 1.0 including adversarial training code and precomputed test cases.
  • Roadmap includes tutorials, additional methods, models, and behaviors.
  • Cites several influential open-source repositories in the LLM security space.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. Users should verify licensing for any included components or dependencies.

Limitations & Caveats

The framework is actively under development, with plans for further tutorials and features. Some red teaming methods may require manual configuration updates for new models.

Health Check
Last commit

11 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
67 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.