Attack framework for aligned LLMs, based on a research paper
Top 12.3% on sourcepulse
This repository provides tools and code for universal and transferable adversarial attacks on aligned language models, specifically targeting jailbreaking. It is intended for researchers and practitioners in AI safety and security who need to understand and defend against such attacks. The primary benefit is enabling the evaluation and mitigation of vulnerabilities in large language models.
How It Works
The project implements the Gradient-based Constrained Generation (GCG) algorithm, a gradient-based optimization method. GCG generates adversarial suffixes that, when appended to a prompt, cause the language model to produce harmful or undesirable outputs, even after alignment training. This approach is advantageous because it demonstrates attacks that are transferable across different models and can target multiple harmful behaviors simultaneously with a single adversarial suffix.
Quick Start & Requirements
pip install -e .
after cloning the repository.fschat==0.2.23
.experiments/configs/
.pip install livelossplot
.Highlighted Details
Maintenance & Community
nanogcg
, is available separately for faster GCG execution.Licensing & Compatibility
Limitations & Caveats
The codebase currently only supports training with LLaMA or Pythia based models; using other models may lead to silent errors. Experiments were conducted on NVIDIA A100 GPUs, suggesting a high hardware requirement for reproduction.
1 year ago
1 day