Discover and explore top open-source AI tools and projects—updated daily.
dropboxLLM security research code for prompt injection
Top 99.8% on SourcePulse
Summary
This repository from Dropbox offers research code and findings on sophisticated prompt injection attacks against Large Language Models (LLMs), particularly focusing on OpenAI's ChatGPT models. It details novel techniques employing repeated tokens and character sequences to bypass model instructions, induce unintended behaviors like hallucinations, and extract sensitive memorized training data. Intended for security researchers and developers, this project provides practical demonstrations to understand and build defenses against these emerging LLM vulnerabilities.
How It Works
The core methodology leverages the insertion of carefully crafted repeated sequences of tokens or characters within LLM prompts. These repetitions, especially multi-token sequences, exploit vulnerabilities to circumvent prompt template instructions and content constraints. This can lead to significant model destabilization, manifesting as jailbreaks, irrelevant responses (hallucinations), or, critically, the leakage of proprietary training data via divergence attacks. The research systematically quantifies the "blackout" effect, measuring how many repetitions are needed for a model to ignore earlier prompt instructions.
Quick Start & Requirements
To utilize this repository, clone it and navigate to the src directory. A Python 3 environment is necessary. Users must configure their OpenAI API key by setting the OPENAI_API_KEY environment variable. The provided demonstration scripts (repeated-tokens.py, question-with-context.py, repeated-sequences.py) can then be executed, allowing selection of specific OpenAI models (e.g., gpt-3.5-turbo, gpt-4) and operational modes.
Highlighted Details
repeated-tokens.py script provides concrete examples of triggering divergence attacks to exfiltrate memorized training data from models like GPT-3.5 and GPT-4.repeated-sequences.py offers quantitative analysis of the "blackout" effect, detailing how various control and space-character sequences impact LLM context retention and instruction following.Maintenance & Community
This repository appears to be a static research artifact, with no explicit mention of ongoing maintenance, community forums (like Discord or Slack), or a public roadmap. It acknowledges contributions from internal and external Dropbox collaborators.
Licensing & Compatibility
The project is licensed under the permissive Apache License, Version 2.0. This license permits broad use, modification, and distribution, including for commercial purposes, provided the terms of the license are adhered to.
Limitations & Caveats
The effectiveness of the demonstrated attacks is contingent upon the current state of OpenAI's filtering mechanisms, which have evolved and may continue to change, potentially rendering some techniques obsolete. The success rate can also vary significantly based on the specific LLM version, the complexity of the prompt, and the precise nature of the repeated sequences employed.
1 year ago
Inactive
protectai
protectai
meta-llama