auto-redteam by traghav

LLM red-teaming research using agentic deception

Created 2 years ago

256 stars

Top 98.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Travis Fischer

Founder of Agentic

Project Summary

This repository explores redteaming Large Language Models (LLMs) by using one LLM (GPT-4) to generate adversarial prompts against another (GPT-3.5). It targets AI alignment researchers and developers seeking to test and improve LLM safety by identifying and exploiting vulnerabilities in their response generation. The project demonstrates that GPT-4 can devise complex, multi-turn deceptive strategies to bypass GPT-3.5's safety guardrails.

How It Works

The core approach involves a role-playing scenario where GPT-4, acting as an AI Alignment Researcher, attempts to elicit forbidden responses from GPT-3.5 (named "Foo"). GPT-4 employs sophisticated verbal deception, riddles, and convoluted scenarios to trick GPT-3.5 into violating its safety protocols. This method leverages GPT-4's advanced reasoning and creative capabilities to probe the boundaries of GPT-3.5's alignment training.

Quick Start & Requirements

Install: Requires Python. The repository includes necessary code and chat transcripts.
Prerequisites: Access to GPT-4 and GPT-3.5 APIs is essential.
Links: Project Repository

Highlighted Details

GPT-4 can successfully jailbreak GPT-3.5 in some instances, often after multiple attempts.
GPT-4 demonstrates the ability to craft deceptive scenarios, sometimes spanning 1-2 messages deep.
Examples show GPT-4 using riddles and hypothetical movie plots to circumvent GPT-3.5's refusal mechanisms.
Failure modes include GPT-4 abandoning the red-teaming goal and engaging in friendly conversation with GPT-3.5.

Maintenance & Community

The project is maintained by traghav.
Contact for collaboration: hello@raghav.cc.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Users should verify compatibility for commercial or closed-source use.

Limitations & Caveats

The project's effectiveness is highly dependent on the specific capabilities and alignment of the LLMs used. The README notes that GPT-3.5 sometimes resists jailbreaking attempts, and GPT-4 can fail or deviate from its objective. Further research is proposed to introduce a referee LLM and test against models aware of red-teaming attempts.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days