dspy-redteam  by haizelabs

Red-teaming language models using automated prompting

Created 1 year ago
250 stars

Top 100.0% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> This project addresses the challenge of systematically red-teaming language models by leveraging DSPy, a framework for structuring and optimizing LM programs. It targets researchers and developers seeking to evaluate LLM safety and robustness, offering a novel approach that significantly enhances attack success rates with minimal manual prompt engineering.

How It Works

<2-4 sentences on core approach / design (key algorithms, models, data flow, or architectural choices) and why this approach is advantageous or novel.> The core approach utilizes DSPy to compile a deep language program composed of alternating "Attack" and "Refine" modules. This program is optimized using DSPy's MIPRO optimizer, guided by an LLM acting as a judge. This method automates prompt generation and program structure, enabling the creation of effective red-teaming agents without extensive manual prompt engineering.

Quick Start & Requirements (only include this section if it contains useful information)

  • The README does not provide specific installation instructions or explicit prerequisites. However, based on the use of DSPy, users will likely need Python and access to a language model API (e.g., OpenAI, Anthropic) for both the compiled program and the judge LLM. Further details may be available on the Haize Labs blog.

Highlighted Details

  • Achieves a 44% Attack Success Rate (ASR) against Vicuna, a 4x improvement over raw input baselines.
  • Demonstrates the effectiveness of DSPy compilation, boosting ASR from 26% (un-optimized architecture) to 44%.
  • Represents the first known attempt to use an auto-prompting framework like DSPy for LLM red-teaming.

Maintenance & Community

  • No information regarding contributors, community channels (like Discord/Slack), or project roadmap is provided in the README.

Licensing & Compatibility

  • The README does not specify a software license. Users should verify licensing terms before integrating this project into commercial or closed-source applications.

Limitations & Caveats

<1-3 sentences on caveats: unsupported platforms, missing features, alpha status, known bugs, breaking changes, bus factor, deprecation, etc. Avoid vague non-statements and judgments.> The project acknowledges that its 44% ASR is not state-of-the-art. The results were achieved with minimal effort in prompt design and hyperparameter tuning, suggesting potential for further optimization but also indicating that significant manual effort might be required to reach SOTA performance. The lack of explicit setup instructions and license information presents adoption hurdles.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
3 more.

promptbench by microsoft

0.1%
3k
LLM evaluation framework
Created 2 years ago
Updated 5 days ago
Feedback? Help us improve.