auto-redteam  by traghav

LLM red-teaming research using agentic deception

Created 2 years ago
256 stars

Top 98.5% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository explores redteaming Large Language Models (LLMs) by using one LLM (GPT-4) to generate adversarial prompts against another (GPT-3.5). It targets AI alignment researchers and developers seeking to test and improve LLM safety by identifying and exploiting vulnerabilities in their response generation. The project demonstrates that GPT-4 can devise complex, multi-turn deceptive strategies to bypass GPT-3.5's safety guardrails.

How It Works

The core approach involves a role-playing scenario where GPT-4, acting as an AI Alignment Researcher, attempts to elicit forbidden responses from GPT-3.5 (named "Foo"). GPT-4 employs sophisticated verbal deception, riddles, and convoluted scenarios to trick GPT-3.5 into violating its safety protocols. This method leverages GPT-4's advanced reasoning and creative capabilities to probe the boundaries of GPT-3.5's alignment training.

Quick Start & Requirements

  • Install: Requires Python. The repository includes necessary code and chat transcripts.
  • Prerequisites: Access to GPT-4 and GPT-3.5 APIs is essential.
  • Links: Project Repository

Highlighted Details

  • GPT-4 can successfully jailbreak GPT-3.5 in some instances, often after multiple attempts.
  • GPT-4 demonstrates the ability to craft deceptive scenarios, sometimes spanning 1-2 messages deep.
  • Examples show GPT-4 using riddles and hypothetical movie plots to circumvent GPT-3.5's refusal mechanisms.
  • Failure modes include GPT-4 abandoning the red-teaming goal and engaging in friendly conversation with GPT-3.5.

Maintenance & Community

  • The project is maintained by traghav.
  • Contact for collaboration: hello@raghav.cc.

Licensing & Compatibility

  • The repository's license is not explicitly stated in the README. Users should verify compatibility for commercial or closed-source use.

Limitations & Caveats

The project's effectiveness is highly dependent on the specific capabilities and alignment of the LLMs used. The README notes that GPT-3.5 sometimes resists jailbreaking attempts, and GPT-4 can fail or deviate from its objective. Further research is proposed to introduce a referee LLM and test against models aware of red-teaming attempts.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), and
4 more.

L1B3RT4S by elder-plinius

0.7%
17k
AI jailbreak prompts
Created 1 year ago
Updated 2 weeks ago
Starred by Pietro Schirano Pietro Schirano(Founder of MagicPath), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

CL4R1T4S by elder-plinius

0.7%
12k
Dataset of system prompts for major AI models + agents
Created 10 months ago
Updated 1 month ago
Feedback? Help us improve.