Visual-Adversarial-Examples-Jailbreak-Large-Language-Models  by Unispac

Visual adversarial examples bypass LLM safety alignments

Created 2 years ago
263 stars

Top 97.0% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This repository addresses the vulnerability of aligned large language models (LLMs) to visual adversarial attacks. It provides code and examples demonstrating how carefully crafted visual inputs can "jailbreak" multimodal LLMs, causing them to generate harmful or offensive content. Aimed at researchers and security professionals, this project offers a novel attack vector to probe and understand the safety limitations of current AI systems.

How It Works

The project generates visual adversarial examples by optimizing images to maximize an LLM's probability of producing undesirable outputs, often using a small, curated corpus of derogatory content. This approach bypasses alignment mechanisms, enabling the model to generate harmful content beyond the specific training data. The generated adversarial images can then be used as input to multimodal LLMs, causing them to falter significantly in their safety alignments.

Quick Start & Requirements

  • Installation: Clone the repository (git clone https://github.com/Unispac/Visual-Adversarial-Examples-Jailbreak-Large-Language-Models.git), create and activate a Conda environment using environment.yml.
  • Prerequisites: Requires pretrained weights for MiniGPT-4 (Vicuna-13B v0 and MiniGPT-4 checkpoint), with paths configurable in YAML files. A single A100 80G GPU is sufficient.
  • Links:
    • Repository: https://github.com/Unispac/Visual-Adversarial-Examples-Jailbreak-Large-Language-Models
    • MiniGPT-4 Huggingface Space: https://huggingface.co/spaces/Vision-CAIR/minigpt4
    • Paper arXiv: https://arxiv.org/abs/2306.13213

Highlighted Details

  • Demonstrates that a single visual adversarial example can jailbreak aligned multimodal LLMs like MiniGPT-4.
  • Attack methodologies are implemented for MiniGPT-4, InstructBLIP, and LLaVA.
  • Adversarial examples can be generated under various distortion constraints (e.g., ɛ = 16/255).
  • Evaluation utilizes the RealToxicityPrompts dataset and toxicity scoring via Perspective API and Detoxify.

Maintenance & Community

The project originates from researchers at Princeton and Stanford Universities, associated with the AAAI 2024 (Oral) paper. No specific community channels or roadmap details are provided in the README.

Licensing & Compatibility

The README does not specify a software license. Users should exercise caution regarding usage rights and potential restrictions, especially for commercial applications.

Limitations & Caveats

This repository contains offensive content and model behaviors. The effectiveness of the attacks is demonstrated on specific VLM architectures and may not generalize universally. Setup requires obtaining and configuring external model weights.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

llm-attacks by llm-attacks

0.2%
4k
Attack framework for aligned LLMs, based on a research paper
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.