LLM red-teaming research using agentic deception
Top 99.3% on sourcepulse
This repository explores redteaming Large Language Models (LLMs) by using one LLM (GPT-4) to generate adversarial prompts against another (GPT-3.5). It targets AI alignment researchers and developers seeking to test and improve LLM safety by identifying and exploiting vulnerabilities in their response generation. The project demonstrates that GPT-4 can devise complex, multi-turn deceptive strategies to bypass GPT-3.5's safety guardrails.
How It Works
The core approach involves a role-playing scenario where GPT-4, acting as an AI Alignment Researcher, attempts to elicit forbidden responses from GPT-3.5 (named "Foo"). GPT-4 employs sophisticated verbal deception, riddles, and convoluted scenarios to trick GPT-3.5 into violating its safety protocols. This method leverages GPT-4's advanced reasoning and creative capabilities to probe the boundaries of GPT-3.5's alignment training.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project's effectiveness is highly dependent on the specific capabilities and alignment of the LLMs used. The README notes that GPT-3.5 sometimes resists jailbreaking attempts, and GPT-4 can fail or deviate from its objective. Further research is proposed to introduce a referee LLM and test against models aware of red-teaming attempts.
2 years ago
Inactive