auto-redteam  by traghav

LLM red-teaming research using agentic deception

Created 2 years ago
255 stars

Top 98.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository explores redteaming Large Language Models (LLMs) by using one LLM (GPT-4) to generate adversarial prompts against another (GPT-3.5). It targets AI alignment researchers and developers seeking to test and improve LLM safety by identifying and exploiting vulnerabilities in their response generation. The project demonstrates that GPT-4 can devise complex, multi-turn deceptive strategies to bypass GPT-3.5's safety guardrails.

How It Works

The core approach involves a role-playing scenario where GPT-4, acting as an AI Alignment Researcher, attempts to elicit forbidden responses from GPT-3.5 (named "Foo"). GPT-4 employs sophisticated verbal deception, riddles, and convoluted scenarios to trick GPT-3.5 into violating its safety protocols. This method leverages GPT-4's advanced reasoning and creative capabilities to probe the boundaries of GPT-3.5's alignment training.

Quick Start & Requirements

  • Install: Requires Python. The repository includes necessary code and chat transcripts.
  • Prerequisites: Access to GPT-4 and GPT-3.5 APIs is essential.
  • Links: Project Repository

Highlighted Details

  • GPT-4 can successfully jailbreak GPT-3.5 in some instances, often after multiple attempts.
  • GPT-4 demonstrates the ability to craft deceptive scenarios, sometimes spanning 1-2 messages deep.
  • Examples show GPT-4 using riddles and hypothetical movie plots to circumvent GPT-3.5's refusal mechanisms.
  • Failure modes include GPT-4 abandoning the red-teaming goal and engaging in friendly conversation with GPT-3.5.

Maintenance & Community

  • The project is maintained by traghav.
  • Contact for collaboration: hello@raghav.cc.

Licensing & Compatibility

  • The repository's license is not explicitly stated in the README. Users should verify compatibility for commercial or closed-source use.

Limitations & Caveats

The project's effectiveness is highly dependent on the specific capabilities and alignment of the LLMs used. The README notes that GPT-3.5 sometimes resists jailbreaking attempts, and GPT-4 can fail or deviate from its objective. Further research is proposed to introduce a referee LLM and test against models aware of red-teaming attempts.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
2 more.

L1B3RT4S by elder-plinius

2.5%
13k
AI jailbreak prompts
Created 1 year ago
Updated 5 days ago
Starred by Pietro Schirano Pietro Schirano(Founder of MagicPath), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

CL4R1T4S by elder-plinius

2.4%
10k
Dataset of system prompts for major AI models + agents
Created 6 months ago
Updated 3 days ago
Feedback? Help us improve.