Discover and explore top open-source AI tools and projects—updated daily.
UnispacVisual adversarial examples bypass LLM safety alignments
Top 97.0% on SourcePulse
Summary
This repository addresses the vulnerability of aligned large language models (LLMs) to visual adversarial attacks. It provides code and examples demonstrating how carefully crafted visual inputs can "jailbreak" multimodal LLMs, causing them to generate harmful or offensive content. Aimed at researchers and security professionals, this project offers a novel attack vector to probe and understand the safety limitations of current AI systems.
How It Works
The project generates visual adversarial examples by optimizing images to maximize an LLM's probability of producing undesirable outputs, often using a small, curated corpus of derogatory content. This approach bypasses alignment mechanisms, enabling the model to generate harmful content beyond the specific training data. The generated adversarial images can then be used as input to multimodal LLMs, causing them to falter significantly in their safety alignments.
Quick Start & Requirements
git clone https://github.com/Unispac/Visual-Adversarial-Examples-Jailbreak-Large-Language-Models.git), create and activate a Conda environment using environment.yml.https://github.com/Unispac/Visual-Adversarial-Examples-Jailbreak-Large-Language-Modelshttps://huggingface.co/spaces/Vision-CAIR/minigpt4https://arxiv.org/abs/2306.13213Highlighted Details
Maintenance & Community
The project originates from researchers at Princeton and Stanford Universities, associated with the AAAI 2024 (Oral) paper. No specific community channels or roadmap details are provided in the README.
Licensing & Compatibility
The README does not specify a software license. Users should exercise caution regarding usage rights and potential restrictions, especially for commercial applications.
Limitations & Caveats
This repository contains offensive content and model behaviors. The effectiveness of the attacks is demonstrated on specific VLM architectures and may not generalize universally. Setup requires obtaining and configuring external model weights.
1 year ago
Inactive
llm-attacks
cleverhans-lab