HarmBench  by centerforaisafety

Evaluation framework for LLM red teaming and defense

Created 1 year ago
726 stars

Top 47.5% on SourcePulse

GitHubView on GitHub
Project Summary

HarmBench is a standardized, open-source framework for evaluating automated red teaming methods and Large Language Model (LLM) attacks and defenses. It provides a scalable platform for researchers and developers to rigorously assess LLM safety and robustness against malicious use cases, enabling the development of more secure AI systems.

How It Works

HarmBench employs a flexible evaluation pipeline that supports two primary use cases: evaluating red teaming methods against LLMs, and evaluating LLMs against red teaming methods. The framework is designed to be modular, allowing users to integrate their own LLMs (including Hugging Face transformers, closed-source APIs, and multimodal models) and red teaming methods. It automates the process of generating test cases, generating model completions, and evaluating these completions, with options for local execution or distributed processing via SLURM.

Quick Start & Requirements

  • Installation:
    git clone https://github.com/centerforaisafety/HarmBench.git
    cd HarmBench
    pip install -r requirements.txt
    python -m spacy download en_core_web_sm
    
  • Prerequisites: Python, spaCy English model (en_core_web_sm). Supports SLURM for distributed execution and Ray for local parallelization.
  • Documentation: Evaluation Pipeline Docs

Highlighted Details

  • Supports 33 evaluated LLMs and 18 red teaming methods in its initial release.
  • Includes three pre-trained classifier models for evaluating standard, contextual, and multimodal behaviors.
  • Facilitates the addition of custom models and red teaming methods through configuration files.
  • Offers an adversarial training method to enhance LLM robustness.

Maintenance & Community

  • Initial release in February 2024, with version 1.0 including adversarial training code and precomputed test cases.
  • Roadmap includes tutorials, additional methods, models, and behaviors.
  • Cites several influential open-source repositories in the LLM security space.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. Users should verify licensing for any included components or dependencies.

Limitations & Caveats

The framework is actively under development, with plans for further tutorials and features. Some red teaming methods may require manual configuration updates for new models.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
23 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
3 more.

promptbench by microsoft

0.1%
3k
LLM evaluation framework
Created 2 years ago
Updated 1 month ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Sebastian Raschka Sebastian Raschka(Author of "Build a Large Language Model (From Scratch)"), and
14 more.

verifiers by willccbb

3.1%
3k
RL for LLMs in verifiable environments
Created 7 months ago
Updated 23 hours ago
Feedback? Help us improve.