Proof-of-concept for jailbreaking Llama 3
Top 58.5% on sourcepulse
This repository demonstrates a simple "priming" technique to bypass the safety guardrails of Meta's Llama 3 large language model. It's intended for researchers and developers interested in understanding LLM safety vulnerabilities and the limitations of current alignment strategies. The primary benefit is showcasing a straightforward method to elicit harmful content from a supposedly safe model.
How It Works
The core technique involves prepending a "harmful prefix" to the prompt given to Llama 3. This prefix, often generated by another helpful-only LLM, primes Llama 3's assistant role to produce a harmful continuation. The project highlights that Llama 3's safety training, while effective in standard dialog flows, can be circumvented when the model is guided by a sufficiently long harmful prefix, overriding its learned refusal mechanisms.
Quick Start & Requirements
llama3_tokenizer.py
script demonstrates the modification to the dialog prompt encoding.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project focuses solely on a specific "priming" attack vector against Llama 3 and may not generalize to other models or attack types. The effectiveness is highly dependent on the specific Llama 3 version and its exact safety tuning.
6 months ago
Inactive