Framework for LLM adversarial prompt attack analysis
Top 73.5% on sourcepulse
PromptInject is a framework for quantitatively analyzing the robustness of Large Language Models (LLMs) against adversarial prompt attacks. It targets researchers and developers working with LLMs in production, offering a method to identify and mitigate vulnerabilities like goal hijacking and prompt leaking. The framework provides a prosaic alignment approach for iterative adversarial prompt composition.
How It Works
PromptInject employs a mask-based iterative adversarial prompt composition technique. This method allows for the systematic construction of malicious prompts designed to manipulate LLM behavior. The framework focuses on two primary attack vectors: goal hijacking, where the LLM is steered to produce a specific target string (potentially containing malicious instructions), and prompt leaking, where the LLM is induced to reveal its original system prompt. This approach leverages the stochastic nature of LLMs to create long-tail risks.
Quick Start & Requirements
pip install git+https://github.com/agencyenterprise/PromptInject
notebooks/Example.ipynb
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README does not specify the exact license, which could impact commercial use. While focused on GPT-3, the effectiveness and implementation details for other LLMs are not detailed.
1 year ago
Inactive