CipherChat by RobustNLP

Framework to evaluate LLM safety alignment generalization

Created 2 years ago

623 stars

Top 53.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Project Summary

CipherChat is a framework designed to evaluate the generalization capabilities of safety alignment in Large Language Models (LLMs) by using ciphers as a method to bypass safety mechanisms. It targets researchers and practitioners in AI safety and LLM evaluation who are interested in understanding the robustness of current alignment techniques. The primary benefit is the ability to systematically test LLM safety against adversarial inputs that are human-unreadable but machine-interpretable.

How It Works

The framework operates on the hypothesis that safety alignments, trained on natural language, may not generalize to non-natural languages like ciphers. It first trains the LLM to understand a specific cipher by providing expert instructions and demonstrations. Input prompts are then encoded into this cipher, making them less likely to trigger existing safety filters. Finally, a rule-based decrypter translates the LLM's cipher-encoded output back into natural language. This approach aims to reveal vulnerabilities in LLM safety alignment by exploiting the gap between human-readable and machine-readable instruction formats.

Quick Start & Requirements

Primary install / run command: python3 main.py --model_name <model_name> --data_path <data_path> --encode_method <cipher_method> --instruction_type <instruction_domain> --demonstration_toxicity <toxic_or_safe> --language <language>
Prerequisites: Python 3, PyTorch. Specific LLM API access may be required depending on the --model_name used.
Results are available in experimental_results and can be loaded with torch.load().

Highlighted Details

Evaluates LLM safety alignment generalization to non-natural languages (ciphers).
Employs a "teach-then-attack" strategy using cipher encoding and rule-based decoding.
Includes experimental results and case studies in the experimental_results folder.
Paper accepted at ICLR 2024.

Maintenance & Community

Project associated with AIDB and Jiao Wenxiang on Twitter.
Citation details provided for academic referencing.

Licensing & Compatibility

Licensed for RESEARCH USE ONLY.
Explicitly states NO MISUSE.

Limitations & Caveats

The framework is designated for "RESEARCH USE ONLY" and prohibits misuse, indicating potential restrictions on commercial or broader deployment. The effectiveness may vary significantly based on the chosen cipher, LLM, and specific safety alignment techniques employed.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days