Visual-Adversarial-Examples-Jailbreak-Large-Language-Models by Unispac

Visual adversarial examples bypass LLM safety alignments

Created 2 years ago

266 stars

Top 96.4% on SourcePulse

Project Summary

Summary

This repository addresses the vulnerability of aligned large language models (LLMs) to visual adversarial attacks. It provides code and examples demonstrating how carefully crafted visual inputs can "jailbreak" multimodal LLMs, causing them to generate harmful or offensive content. Aimed at researchers and security professionals, this project offers a novel attack vector to probe and understand the safety limitations of current AI systems.

How It Works

The project generates visual adversarial examples by optimizing images to maximize an LLM's probability of producing undesirable outputs, often using a small, curated corpus of derogatory content. This approach bypasses alignment mechanisms, enabling the model to generate harmful content beyond the specific training data. The generated adversarial images can then be used as input to multimodal LLMs, causing them to falter significantly in their safety alignments.

Quick Start & Requirements

Installation: Clone the repository (git clone https://github.com/Unispac/Visual-Adversarial-Examples-Jailbreak-Large-Language-Models.git), create and activate a Conda environment using environment.yml.
Prerequisites: Requires pretrained weights for MiniGPT-4 (Vicuna-13B v0 and MiniGPT-4 checkpoint), with paths configurable in YAML files. A single A100 80G GPU is sufficient.
Links:
- Repository: https://github.com/Unispac/Visual-Adversarial-Examples-Jailbreak-Large-Language-Models
- MiniGPT-4 Huggingface Space: https://huggingface.co/spaces/Vision-CAIR/minigpt4
- Paper arXiv: https://arxiv.org/abs/2306.13213

Highlighted Details

Demonstrates that a single visual adversarial example can jailbreak aligned multimodal LLMs like MiniGPT-4.
Attack methodologies are implemented for MiniGPT-4, InstructBLIP, and LLaVA.
Adversarial examples can be generated under various distortion constraints (e.g., ɛ = 16/255).
Evaluation utilizes the RealToxicityPrompts dataset and toxicity scoring via Perspective API and Detoxify.

Maintenance & Community

The project originates from researchers at Princeton and Stanford Universities, associated with the AAAI 2024 (Oral) paper. No specific community channels or roadmap details are provided in the README.

Licensing & Compatibility

The README does not specify a software license. Users should exercise caution regarding usage rights and potential restrictions, especially for commercial applications.

Limitations & Caveats

This repository contains offensive content and model behaviors. The effectiveness of the attacks is demonstrated on specific VLM architectures and may not generalize universally. Setup requires obtaining and configuring external model weights.

Visual-Adversarial-Examples-Jailbreak-Large-Language-Models by Unispac

Explore Similar Projects

Awesome-Multimodal-Jailbreak by liuxuannan

Awesome-LVLM-Attack by liudaizong

llm-security by dropbox

circuit-breakers by GraySwanAI

PromptInject by agencyenterprise

awesome-prompt-injection by Joe-B-Security

advmlthreatmatrix by mitre

Spiritual-Spell-Red-Teaming by Goochbeater

TAADpapers by thunlp

deepteam by confident-ai

llm-attacks by llm-attacks

cleverhans by cleverhans-lab