bon-jailbreaking  by jplhughes

Code for jailbreaking LLMs using a Best-of-N approach

created 9 months ago
528 stars

Top 60.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides code for "Best-of-N" jailbreaking, a technique to improve the robustness of large language models (LLMs) against adversarial attacks. It is intended for researchers and practitioners working on LLM security and alignment.

How It Works

The project implements a "Best-of-N" sampling strategy, where multiple responses are generated for a given prompt, and the best one is selected based on certain criteria. This approach aims to increase the likelihood of obtaining a safe and aligned response, even when faced with jailbreaking attempts.

Quick Start & Requirements

  • Installation: Requires micromamba for environment management.
    • Create environment: micromamba env create -n bon python=3.11.7
    • Activate environment: micromamba activate bon
    • Install dependencies: pip install -r requirements.txt
    • Install package: pip install -e .
    • Install WavAugment: git clone git@github.com:facebookresearch/WavAugment.git && cd WavAugment && python setup.py develop
  • Prerequisites: Kaldi (install script provided), OpenAI API key, Google API keys, Hugging Face API key (for Llama3/Circuit Breaking), Grayswan API key (for Cygnet), ElevenLabs API key (for ElevenLabs TTS).
  • Setup: Requires creating a SECRETS file with API keys.
  • Documentation: micromamba installation

Highlighted Details

  • Implements "Best-of-N" jailbreaking for LLM robustness.
  • Includes a dataset of human verbalized jailbreaks from Harmbench (308 PAIR, 307 TAP, 159 direct requests).
  • Provides scripts to replicate experiments, including Figure 1.
  • Requires significant API key setup for various LLM and TTS services.

Maintenance & Community

No specific information on contributors, sponsorships, or community channels is provided in the README.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial use or closed-source linking is not mentioned.

Limitations & Caveats

The setup process is complex, requiring micromamba, Kaldi, and multiple API keys from various services. The project appears to be a research code release, and its stability or production-readiness is not indicated.

Health Check
Last commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
38 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.