dspy-redteam by haizelabs

Red-teaming language models using automated prompting

Created 2 years ago

253 stars

Top 99.3% on SourcePulse

View on GitHub

1 Expert Loves This Project

Omar Khattab

Coauthor of DSPy, ColBERT; Professor at MIT

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> This project addresses the challenge of systematically red-teaming language models by leveraging DSPy, a framework for structuring and optimizing LM programs. It targets researchers and developers seeking to evaluate LLM safety and robustness, offering a novel approach that significantly enhances attack success rates with minimal manual prompt engineering.

How It Works

<2-4 sentences on core approach / design (key algorithms, models, data flow, or architectural choices) and why this approach is advantageous or novel.> The core approach utilizes DSPy to compile a deep language program composed of alternating "Attack" and "Refine" modules. This program is optimized using DSPy's MIPRO optimizer, guided by an LLM acting as a judge. This method automates prompt generation and program structure, enabling the creation of effective red-teaming agents without extensive manual prompt engineering.

Quick Start & Requirements (only include this section if it contains useful information)

The README does not provide specific installation instructions or explicit prerequisites. However, based on the use of DSPy, users will likely need Python and access to a language model API (e.g., OpenAI, Anthropic) for both the compiled program and the judge LLM. Further details may be available on the Haize Labs blog.

Highlighted Details

Achieves a 44% Attack Success Rate (ASR) against Vicuna, a 4x improvement over raw input baselines.
Demonstrates the effectiveness of DSPy compilation, boosting ASR from 26% (un-optimized architecture) to 44%.
Represents the first known attempt to use an auto-prompting framework like DSPy for LLM red-teaming.

Maintenance & Community

No information regarding contributors, community channels (like Discord/Slack), or project roadmap is provided in the README.

Licensing & Compatibility

The README does not specify a software license. Users should verify licensing terms before integrating this project into commercial or closed-source applications.

Limitations & Caveats

<1-3 sentences on caveats: unsupported platforms, missing features, alpha status, known bugs, breaking changes, bus factor, deprecation, etc. Avoid vague non-statements and judgments.> The project acknowledges that its 44% ASR is not state-of-the-art. The results were achieved with minimal effort in prompt design and hyperparameter tuning, suggesting potential for further optimization but also indicating that significant manual effort might be required to reach SOTA performance. The lack of explicit setup instructions and license information presents adoption hurdles.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days