CipherChat  by RobustNLP

Framework to evaluate LLM safety alignment generalization

created 2 years ago
609 stars

Top 54.6% on sourcepulse

GitHubView on GitHub
Project Summary

CipherChat is a framework designed to evaluate the generalization capabilities of safety alignment in Large Language Models (LLMs) by using ciphers as a method to bypass safety mechanisms. It targets researchers and practitioners in AI safety and LLM evaluation who are interested in understanding the robustness of current alignment techniques. The primary benefit is the ability to systematically test LLM safety against adversarial inputs that are human-unreadable but machine-interpretable.

How It Works

The framework operates on the hypothesis that safety alignments, trained on natural language, may not generalize to non-natural languages like ciphers. It first trains the LLM to understand a specific cipher by providing expert instructions and demonstrations. Input prompts are then encoded into this cipher, making them less likely to trigger existing safety filters. Finally, a rule-based decrypter translates the LLM's cipher-encoded output back into natural language. This approach aims to reveal vulnerabilities in LLM safety alignment by exploiting the gap between human-readable and machine-readable instruction formats.

Quick Start & Requirements

  • Primary install / run command: python3 main.py --model_name <model_name> --data_path <data_path> --encode_method <cipher_method> --instruction_type <instruction_domain> --demonstration_toxicity <toxic_or_safe> --language <language>
  • Prerequisites: Python 3, PyTorch. Specific LLM API access may be required depending on the --model_name used.
  • Results are available in experimental_results and can be loaded with torch.load().

Highlighted Details

  • Evaluates LLM safety alignment generalization to non-natural languages (ciphers).
  • Employs a "teach-then-attack" strategy using cipher encoding and rule-based decoding.
  • Includes experimental results and case studies in the experimental_results folder.
  • Paper accepted at ICLR 2024.

Maintenance & Community

  • Project associated with AIDB and Jiao Wenxiang on Twitter.
  • Citation details provided for academic referencing.

Licensing & Compatibility

  • Licensed for RESEARCH USE ONLY.
  • Explicitly states NO MISUSE.

Limitations & Caveats

The framework is designated for "RESEARCH USE ONLY" and prohibits misuse, indicating potential restrictions on commercial or broader deployment. The effectiveness may vary significantly based on the chosen cipher, LLM, and specific safety alignment techniques employed.

Health Check
Last commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), and
2 more.

llm-security by greshake

0.2%
2k
Research paper on indirect prompt injection attacks targeting app-integrated LLMs
created 2 years ago
updated 2 weeks ago
Starred by Dan Guido Dan Guido(Cofounder of Trail of Bits), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

PurpleLlama by meta-llama

0.5%
4k
LLM security toolkit for assessing/improving generative AI models
created 1 year ago
updated 1 week ago
Feedback? Help us improve.