bon-jailbreaking by jplhughes

Code for jailbreaking LLMs using a Best-of-N approach

Created 1 year ago

549 stars

Top 58.2% on SourcePulse

Project Summary

This repository provides code for "Best-of-N" jailbreaking, a technique to improve the robustness of large language models (LLMs) against adversarial attacks. It is intended for researchers and practitioners working on LLM security and alignment.

How It Works

The project implements a "Best-of-N" sampling strategy, where multiple responses are generated for a given prompt, and the best one is selected based on certain criteria. This approach aims to increase the likelihood of obtaining a safe and aligned response, even when faced with jailbreaking attempts.

Quick Start & Requirements

Installation: Requires micromamba for environment management.
- Create environment: micromamba env create -n bon python=3.11.7
- Activate environment: micromamba activate bon
- Install dependencies: pip install -r requirements.txt
- Install package: pip install -e .
- Install WavAugment: git clone git@github.com:facebookresearch/WavAugment.git && cd WavAugment && python setup.py develop
Prerequisites: Kaldi (install script provided), OpenAI API key, Google API keys, Hugging Face API key (for Llama3/Circuit Breaking), Grayswan API key (for Cygnet), ElevenLabs API key (for ElevenLabs TTS).
Setup: Requires creating a SECRETS file with API keys.
Documentation: micromamba installation

Highlighted Details

Implements "Best-of-N" jailbreaking for LLM robustness.
Includes a dataset of human verbalized jailbreaks from Harmbench (308 PAIR, 307 TAP, 159 direct requests).
Provides scripts to replicate experiments, including Figure 1.
Requires significant API key setup for various LLM and TTS services.

Maintenance & Community

No specific information on contributors, sponsorships, or community channels is provided in the README.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial use or closed-source linking is not mentioned.

Limitations & Caveats

The setup process is complex, requiring micromamba, Kaldi, and multiple API keys from various services. The project appears to be a research code release, and its stability or production-readiness is not indicated.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days