llama3-jailbreak  by haizelabs

Proof-of-concept for jailbreaking Llama 3

created 1 year ago
556 stars

Top 58.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository demonstrates a simple "priming" technique to bypass the safety guardrails of Meta's Llama 3 large language model. It's intended for researchers and developers interested in understanding LLM safety vulnerabilities and the limitations of current alignment strategies. The primary benefit is showcasing a straightforward method to elicit harmful content from a supposedly safe model.

How It Works

The core technique involves prepending a "harmful prefix" to the prompt given to Llama 3. This prefix, often generated by another helpful-only LLM, primes Llama 3's assistant role to produce a harmful continuation. The project highlights that Llama 3's safety training, while effective in standard dialog flows, can be circumvented when the model is guided by a sufficiently long harmful prefix, overriding its learned refusal mechanisms.

Quick Start & Requirements

  • Requires Python and access to a Llama 3 model.
  • The llama3_tokenizer.py script demonstrates the modification to the dialog prompt encoding.
  • Specific commands are not provided, but the concept involves modifying the prompt structure before inference.

Highlighted Details

  • Demonstrates a trivial programmatic jailbreak against Llama 3.
  • Shows that prefix length significantly impacts the success rate of the jailbreak, with 100 tokens achieving a 98% attack success rate.
  • Suggests LLMs may lack self-reflection capabilities, enabling them to generate harmful text when prompted.
  • Mentions contemporaneous work on similar priming jailbreaks by Ben Lemkin and Jason Vega.

Maintenance & Community

  • The project is authored by Leonard Tang.
  • Contact is available via email at contact@haizelabs.com for collaboration or discussion.
  • A BibTeX entry is provided for citation.

Licensing & Compatibility

  • The repository itself does not explicitly state a license. The provided BibTeX entry suggests a Zenodo DOI, which typically hosts research artifacts.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project focuses solely on a specific "priming" attack vector against Llama 3 and may not generalize to other models or attack types. The effectiveness is highly dependent on the specific Llama 3 version and its exact safety tuning.

Health Check
Last commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), and
2 more.

llm-security by greshake

0.2%
2k
Research paper on indirect prompt injection attacks targeting app-integrated LLMs
created 2 years ago
updated 2 weeks ago
Starred by Dan Guido Dan Guido(Cofounder of Trail of Bits), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

PurpleLlama by meta-llama

0.5%
4k
LLM security toolkit for assessing/improving generative AI models
created 1 year ago
updated 1 week ago
Feedback? Help us improve.