llm-attacks by llm-attacks

Attack framework for aligned LLMs, based on a research paper

Created 2 years ago

4,429 stars

Top 11.0% on SourcePulse

View on GitHub

8 Experts Love This Project

Elie Bursztein

Cybersecurity Lead at Google DeepMind

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Jeremy Howard

Cofounder of fast.ai

Yaowei Zheng

Author of LLaMA-Factory

and 4 more!

Project Summary

This repository provides tools and code for universal and transferable adversarial attacks on aligned language models, specifically targeting jailbreaking. It is intended for researchers and practitioners in AI safety and security who need to understand and defend against such attacks. The primary benefit is enabling the evaluation and mitigation of vulnerabilities in large language models.

How It Works

The project implements the Gradient-based Constrained Generation (GCG) algorithm, a gradient-based optimization method. GCG generates adversarial suffixes that, when appended to a prompt, cause the language model to produce harmful or undesirable outputs, even after alignment training. This approach is advantageous because it demonstrates attacks that are transferable across different models and can target multiple harmful behaviors simultaneously with a single adversarial suffix.

Quick Start & Requirements

Install via pip install -e . after cloning the repository.
Requires fschat==0.2.23.
Models (Vicuna-7B or LLaMA-2-7B-Chat) must be downloaded and paths configured in experiments/configs/.
For demo: pip install livelossplot.
Official demo notebook: demo.ipynb or on Colab.

Highlighted Details

Implements Universal and Transferable Adversarial Attacks on Aligned Language Models.
Includes code to reproduce GCG experiments on AdvBench for individual, multiple, and transfer attacks.
Supports LLaMA or Pythia based models; other models may cause silent errors.
Requires significant GPU resources (NVIDIA A100 80G mentioned for experiments).

Maintenance & Community

The project is associated with the paper "Universal and Transferable Adversarial Attacks on Aligned Language Models."
A new implementation, nanogcg, is available separately for faster GCG execution.

Licensing & Compatibility

Licensed under the MIT license.
Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The codebase currently only supports training with LLaMA or Pythia based models; using other models may lead to silent errors. Experiments were conducted on NVIDIA A100 GPUs, suggesting a high hardware requirement for reproduction.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

58 stars in the last 30 days