Framework to evaluate LLM safety alignment generalization
Top 54.6% on sourcepulse
CipherChat is a framework designed to evaluate the generalization capabilities of safety alignment in Large Language Models (LLMs) by using ciphers as a method to bypass safety mechanisms. It targets researchers and practitioners in AI safety and LLM evaluation who are interested in understanding the robustness of current alignment techniques. The primary benefit is the ability to systematically test LLM safety against adversarial inputs that are human-unreadable but machine-interpretable.
How It Works
The framework operates on the hypothesis that safety alignments, trained on natural language, may not generalize to non-natural languages like ciphers. It first trains the LLM to understand a specific cipher by providing expert instructions and demonstrations. Input prompts are then encoded into this cipher, making them less likely to trigger existing safety filters. Finally, a rule-based decrypter translates the LLM's cipher-encoded output back into natural language. This approach aims to reveal vulnerabilities in LLM safety alignment by exploiting the gap between human-readable and machine-readable instruction formats.
Quick Start & Requirements
python3 main.py --model_name <model_name> --data_path <data_path> --encode_method <cipher_method> --instruction_type <instruction_domain> --demonstration_toxicity <toxic_or_safe> --language <language>
--model_name
used.experimental_results
and can be loaded with torch.load()
.Highlighted Details
experimental_results
folder.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The framework is designated for "RESEARCH USE ONLY" and prohibits misuse, indicating potential restrictions on commercial or broader deployment. The effectiveness may vary significantly based on the chosen cipher, LLM, and specific safety alignment techniques employed.
7 months ago
Inactive