llm-attacks  by llm-attacks

Attack framework for aligned LLMs, based on a research paper

created 2 years ago
4,076 stars

Top 12.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides tools and code for universal and transferable adversarial attacks on aligned language models, specifically targeting jailbreaking. It is intended for researchers and practitioners in AI safety and security who need to understand and defend against such attacks. The primary benefit is enabling the evaluation and mitigation of vulnerabilities in large language models.

How It Works

The project implements the Gradient-based Constrained Generation (GCG) algorithm, a gradient-based optimization method. GCG generates adversarial suffixes that, when appended to a prompt, cause the language model to produce harmful or undesirable outputs, even after alignment training. This approach is advantageous because it demonstrates attacks that are transferable across different models and can target multiple harmful behaviors simultaneously with a single adversarial suffix.

Quick Start & Requirements

  • Install via pip install -e . after cloning the repository.
  • Requires fschat==0.2.23.
  • Models (Vicuna-7B or LLaMA-2-7B-Chat) must be downloaded and paths configured in experiments/configs/.
  • For demo: pip install livelossplot.
  • Official demo notebook: demo.ipynb or on Colab.

Highlighted Details

  • Implements Universal and Transferable Adversarial Attacks on Aligned Language Models.
  • Includes code to reproduce GCG experiments on AdvBench for individual, multiple, and transfer attacks.
  • Supports LLaMA or Pythia based models; other models may cause silent errors.
  • Requires significant GPU resources (NVIDIA A100 80G mentioned for experiments).

Maintenance & Community

  • The project is associated with the paper "Universal and Transferable Adversarial Attacks on Aligned Language Models."
  • A new implementation, nanogcg, is available separately for faster GCG execution.

Licensing & Compatibility

  • Licensed under the MIT license.
  • Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The codebase currently only supports training with LLaMA or Pythia based models; using other models may lead to silent errors. Experiments were conducted on NVIDIA A100 GPUs, suggesting a high hardware requirement for reproduction.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
213 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.