refusal_direction  by andyrdt

Research paper code for analyzing refusal in language models

created 1 year ago
254 stars

Top 99.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides code and results for the paper "Refusal in Language Models Is Mediated by a Single Direction." It enables researchers and practitioners to reproduce findings on how LLM refusal behavior can be controlled via a specific vector direction, offering a method to potentially mitigate or induce refusal.

How It Works

The project implements a pipeline to identify and apply a "refusal direction" in LLM activations. This direction is found by analyzing the difference in activations between harmful and harmless prompts. By manipulating activations along this direction, the model's propensity to refuse can be altered. This approach offers a targeted and efficient method for controlling LLM behavior compared to full fine-tuning.

Quick Start & Requirements

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
12 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
2 more.

L1B3RT4S by elder-plinius

11.3%
11k
AI jailbreak prompts
created 1 year ago
updated 4 days ago
Feedback? Help us improve.