auto-redteam  by traghav

LLM red-teaming research using agentic deception

Created 2 years ago
253 stars

Top 99.4% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository explores redteaming Large Language Models (LLMs) by using one LLM (GPT-4) to generate adversarial prompts against another (GPT-3.5). It targets AI alignment researchers and developers seeking to test and improve LLM safety by identifying and exploiting vulnerabilities in their response generation. The project demonstrates that GPT-4 can devise complex, multi-turn deceptive strategies to bypass GPT-3.5's safety guardrails.

How It Works

The core approach involves a role-playing scenario where GPT-4, acting as an AI Alignment Researcher, attempts to elicit forbidden responses from GPT-3.5 (named "Foo"). GPT-4 employs sophisticated verbal deception, riddles, and convoluted scenarios to trick GPT-3.5 into violating its safety protocols. This method leverages GPT-4's advanced reasoning and creative capabilities to probe the boundaries of GPT-3.5's alignment training.

Quick Start & Requirements

  • Install: Requires Python. The repository includes necessary code and chat transcripts.
  • Prerequisites: Access to GPT-4 and GPT-3.5 APIs is essential.
  • Links: Project Repository

Highlighted Details

  • GPT-4 can successfully jailbreak GPT-3.5 in some instances, often after multiple attempts.
  • GPT-4 demonstrates the ability to craft deceptive scenarios, sometimes spanning 1-2 messages deep.
  • Examples show GPT-4 using riddles and hypothetical movie plots to circumvent GPT-3.5's refusal mechanisms.
  • Failure modes include GPT-4 abandoning the red-teaming goal and engaging in friendly conversation with GPT-3.5.

Maintenance & Community

  • The project is maintained by traghav.
  • Contact for collaboration: hello@raghav.cc.

Licensing & Compatibility

  • The repository's license is not explicitly stated in the README. Users should verify compatibility for commercial or closed-source use.

Limitations & Caveats

The project's effectiveness is highly dependent on the specific capabilities and alignment of the LLMs used. The README notes that GPT-3.5 sometimes resists jailbreaking attempts, and GPT-4 can fail or deviate from its objective. Further research is proposed to introduce a referee LLM and test against models aware of red-teaming attempts.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

L1B3RT4S by elder-plinius

1.3%
15k
AI jailbreak prompts
Created 1 year ago
Updated 5 days ago
Starred by Pietro Schirano Pietro Schirano(Founder of MagicPath), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

CL4R1T4S by elder-plinius

1.2%
12k
Dataset of system prompts for major AI models + agents
Created 8 months ago
Updated 5 days ago
Feedback? Help us improve.