auto-redteam  by traghav

LLM red-teaming research using agentic deception

created 2 years ago
254 stars

Top 99.3% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository explores redteaming Large Language Models (LLMs) by using one LLM (GPT-4) to generate adversarial prompts against another (GPT-3.5). It targets AI alignment researchers and developers seeking to test and improve LLM safety by identifying and exploiting vulnerabilities in their response generation. The project demonstrates that GPT-4 can devise complex, multi-turn deceptive strategies to bypass GPT-3.5's safety guardrails.

How It Works

The core approach involves a role-playing scenario where GPT-4, acting as an AI Alignment Researcher, attempts to elicit forbidden responses from GPT-3.5 (named "Foo"). GPT-4 employs sophisticated verbal deception, riddles, and convoluted scenarios to trick GPT-3.5 into violating its safety protocols. This method leverages GPT-4's advanced reasoning and creative capabilities to probe the boundaries of GPT-3.5's alignment training.

Quick Start & Requirements

  • Install: Requires Python. The repository includes necessary code and chat transcripts.
  • Prerequisites: Access to GPT-4 and GPT-3.5 APIs is essential.
  • Links: Project Repository

Highlighted Details

  • GPT-4 can successfully jailbreak GPT-3.5 in some instances, often after multiple attempts.
  • GPT-4 demonstrates the ability to craft deceptive scenarios, sometimes spanning 1-2 messages deep.
  • Examples show GPT-4 using riddles and hypothetical movie plots to circumvent GPT-3.5's refusal mechanisms.
  • Failure modes include GPT-4 abandoning the red-teaming goal and engaging in friendly conversation with GPT-3.5.

Maintenance & Community

  • The project is maintained by traghav.
  • Contact for collaboration: hello@raghav.cc.

Licensing & Compatibility

  • The repository's license is not explicitly stated in the README. Users should verify compatibility for commercial or closed-source use.

Limitations & Caveats

The project's effectiveness is highly dependent on the specific capabilities and alignment of the LLMs used. The README notes that GPT-3.5 sometimes resists jailbreaking attempts, and GPT-4 can fail or deviate from its objective. Further research is proposed to introduce a referee LLM and test against models aware of red-teaming attempts.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.