llama3-jailbreak by haizelabs

Proof-of-concept for jailbreaking Llama 3

created 1 year ago

556 stars

Top 58.5% on sourcepulse

Project Summary

This repository demonstrates a simple "priming" technique to bypass the safety guardrails of Meta's Llama 3 large language model. It's intended for researchers and developers interested in understanding LLM safety vulnerabilities and the limitations of current alignment strategies. The primary benefit is showcasing a straightforward method to elicit harmful content from a supposedly safe model.

How It Works

The core technique involves prepending a "harmful prefix" to the prompt given to Llama 3. This prefix, often generated by another helpful-only LLM, primes Llama 3's assistant role to produce a harmful continuation. The project highlights that Llama 3's safety training, while effective in standard dialog flows, can be circumvented when the model is guided by a sufficiently long harmful prefix, overriding its learned refusal mechanisms.

Quick Start & Requirements

Requires Python and access to a Llama 3 model.
The llama3_tokenizer.py script demonstrates the modification to the dialog prompt encoding.
Specific commands are not provided, but the concept involves modifying the prompt structure before inference.

Highlighted Details

Demonstrates a trivial programmatic jailbreak against Llama 3.
Shows that prefix length significantly impacts the success rate of the jailbreak, with 100 tokens achieving a 98% attack success rate.
Suggests LLMs may lack self-reflection capabilities, enabling them to generate harmful text when prompted.
Mentions contemporaneous work on similar priming jailbreaks by Ben Lemkin and Jason Vega.

Maintenance & Community

The project is authored by Leonard Tang.
Contact is available via email at contact@haizelabs.com for collaboration or discussion.
A BibTeX entry is provided for citation.

Licensing & Compatibility

The repository itself does not explicitly state a license. The provided BibTeX entry suggests a Zenodo DOI, which typically hosts research artifacts.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project focuses solely on a specific "priming" attack vector against Llama 3 and may not generalize to other models or attack types. The effectiveness is highly dependent on the specific Llama 3 version and its exact safety tuning.

llama3-jailbreak by haizelabs

Explore Similar Projects

LLM-Jailbreaks by langgptai

awesome-llama-prompts by langgptai

PromptInject by agencyenterprise

AutoDAN by SheltonLiu-N

CipherChat by RobustNLP

GPTFuzz by sherdencooper

Llama by amitsangani

bon-jailbreaking by jplhughes

JailbreakingLLMs by patrickrchao

llm-security by greshake

PromptJailbreakManual by Acmesec

PurpleLlama by meta-llama