llm-security by dropbox

LLM security research code for prompt injection

Created 2 years ago

255 stars

Top 98.7% on SourcePulse

Project Summary

Summary

This repository from Dropbox offers research code and findings on sophisticated prompt injection attacks against Large Language Models (LLMs), particularly focusing on OpenAI's ChatGPT models. It details novel techniques employing repeated tokens and character sequences to bypass model instructions, induce unintended behaviors like hallucinations, and extract sensitive memorized training data. Intended for security researchers and developers, this project provides practical demonstrations to understand and build defenses against these emerging LLM vulnerabilities.

How It Works

The core methodology leverages the insertion of carefully crafted repeated sequences of tokens or characters within LLM prompts. These repetitions, especially multi-token sequences, exploit vulnerabilities to circumvent prompt template instructions and content constraints. This can lead to significant model destabilization, manifesting as jailbreaks, irrelevant responses (hallucinations), or, critically, the leakage of proprietary training data via divergence attacks. The research systematically quantifies the "blackout" effect, measuring how many repetitions are needed for a model to ignore earlier prompt instructions.

Quick Start & Requirements

To utilize this repository, clone it and navigate to the src directory. A Python 3 environment is necessary. Users must configure their OpenAI API key by setting the OPENAI_API_KEY environment variable. The provided demonstration scripts (repeated-tokens.py, question-with-context.py, repeated-sequences.py) can then be executed, allowing selection of specific OpenAI models (e.g., gpt-3.5-turbo, gpt-4) and operational modes.

Highlighted Details

Multi-token sequence repetitions have proven effective for prompt injection and data extraction, bypassing OpenAI's filters that previously mitigated single-token repetition attacks.
The repeated-tokens.py script provides concrete examples of triggering divergence attacks to exfiltrate memorized training data from models like GPT-3.5 and GPT-4.
repeated-sequences.py offers quantitative analysis of the "blackout" effect, detailing how various control and space-character sequences impact LLM context retention and instruction following.
The research suggests that statistical analysis of character frequencies, rather than solely relying on specific sequence detection, may represent a more robust strategy for identifying and preventing these types of prompt injection attacks.

Maintenance & Community

This repository appears to be a static research artifact, with no explicit mention of ongoing maintenance, community forums (like Discord or Slack), or a public roadmap. It acknowledges contributions from internal and external Dropbox collaborators.

Licensing & Compatibility

The project is licensed under the permissive Apache License, Version 2.0. This license permits broad use, modification, and distribution, including for commercial purposes, provided the terms of the license are adhered to.

Limitations & Caveats

The effectiveness of the demonstrated attacks is contingent upon the current state of OpenAI's filtering mechanisms, which have evolved and may continue to change, potentially rendering some techniques obsolete. The success rate can also vary significantly based on the specific LLM version, the complexity of the prompt, and the precise nature of the repeated sequences employed.

llm-security by dropbox

Explore Similar Projects

aegis by automorphic-ai

prompt-hacker-collections by yunwei37

camel-prompt-injection by google-research

HouYi by LLMSecurity

awesome-prompt-injection by Joe-B-Security

zeroleaks by ZeroLeaks

ps-fuzz by prompt-security

awesome-gpt-security by cckuailong

rebuff by protectai

vulnhuntr by protectai

llm-guard by protectai

PurpleLlama by meta-llama