refusal_direction  by andyrdt

Research paper code for analyzing refusal in language models

Created 1 year ago
285 stars

Top 91.8% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides code and results for the paper "Refusal in Language Models Is Mediated by a Single Direction." It enables researchers and practitioners to reproduce findings on how LLM refusal behavior can be controlled via a specific vector direction, offering a method to potentially mitigate or induce refusal.

How It Works

The project implements a pipeline to identify and apply a "refusal direction" in LLM activations. This direction is found by analyzing the difference in activations between harmful and harmless prompts. By manipulating activations along this direction, the model's propensity to refuse can be altered. This approach offers a targeted and efficient method for controlling LLM behavior compared to full fine-tuning.

Quick Start & Requirements

Health Check
Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
17 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luca Soldaini Luca Soldaini(Research Scientist at Ai2), and
7 more.

hh-rlhf by anthropics

0.2%
2k
RLHF dataset for training safe AI assistants
Created 3 years ago
Updated 3 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
3 more.

promptbench by microsoft

0.3%
3k
LLM evaluation framework
Created 2 years ago
Updated 2 days ago
Feedback? Help us improve.