Code for jailbreaking LLMs using a Best-of-N approach
Top 60.7% on sourcepulse
This repository provides code for "Best-of-N" jailbreaking, a technique to improve the robustness of large language models (LLMs) against adversarial attacks. It is intended for researchers and practitioners working on LLM security and alignment.
How It Works
The project implements a "Best-of-N" sampling strategy, where multiple responses are generated for a given prompt, and the best one is selected based on certain criteria. This approach aims to increase the likelihood of obtaining a safe and aligned response, even when faced with jailbreaking attempts.
Quick Start & Requirements
micromamba
for environment management.
micromamba env create -n bon python=3.11.7
micromamba activate bon
pip install -r requirements.txt
pip install -e .
git clone git@github.com:facebookresearch/WavAugment.git && cd WavAugment && python setup.py develop
SECRETS
file with API keys.Highlighted Details
Maintenance & Community
No specific information on contributors, sponsorships, or community channels is provided in the README.
Licensing & Compatibility
The README does not specify a license. Compatibility for commercial use or closed-source linking is not mentioned.
Limitations & Caveats
The setup process is complex, requiring micromamba
, Kaldi, and multiple API keys from various services. The project appears to be a research code release, and its stability or production-readiness is not indicated.
5 months ago
Inactive