llm-attacks  by llm-attacks

Attack framework for aligned LLMs, based on a research paper

Created 2 years ago
4,212 stars

Top 11.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides tools and code for universal and transferable adversarial attacks on aligned language models, specifically targeting jailbreaking. It is intended for researchers and practitioners in AI safety and security who need to understand and defend against such attacks. The primary benefit is enabling the evaluation and mitigation of vulnerabilities in large language models.

How It Works

The project implements the Gradient-based Constrained Generation (GCG) algorithm, a gradient-based optimization method. GCG generates adversarial suffixes that, when appended to a prompt, cause the language model to produce harmful or undesirable outputs, even after alignment training. This approach is advantageous because it demonstrates attacks that are transferable across different models and can target multiple harmful behaviors simultaneously with a single adversarial suffix.

Quick Start & Requirements

  • Install via pip install -e . after cloning the repository.
  • Requires fschat==0.2.23.
  • Models (Vicuna-7B or LLaMA-2-7B-Chat) must be downloaded and paths configured in experiments/configs/.
  • For demo: pip install livelossplot.
  • Official demo notebook: demo.ipynb or on Colab.

Highlighted Details

  • Implements Universal and Transferable Adversarial Attacks on Aligned Language Models.
  • Includes code to reproduce GCG experiments on AdvBench for individual, multiple, and transfer attacks.
  • Supports LLaMA or Pythia based models; other models may cause silent errors.
  • Requires significant GPU resources (NVIDIA A100 80G mentioned for experiments).

Maintenance & Community

  • The project is associated with the paper "Universal and Transferable Adversarial Attacks on Aligned Language Models."
  • A new implementation, nanogcg, is available separately for faster GCG execution.

Licensing & Compatibility

  • Licensed under the MIT license.
  • Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The codebase currently only supports training with LLaMA or Pythia based models; using other models may lead to silent errors. Experiments were conducted on NVIDIA A100 GPUs, suggesting a high hardware requirement for reproduction.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
71 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Michele Castata Michele Castata(President of Replit), and
3 more.

rebuff by protectai

0.4%
1k
SDK for LLM prompt injection detection
Created 2 years ago
Updated 1 year ago
Starred by Dan Guido Dan Guido(Cofounder of Trail of Bits), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

PurpleLlama by meta-llama

0.6%
4k
LLM security toolkit for assessing/improving generative AI models
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.