Research paper code for analyzing refusal in language models
Top 99.1% on SourcePulse
This repository provides code and results for the paper "Refusal in Language Models Is Mediated by a Single Direction." It enables researchers and practitioners to reproduce findings on how LLM refusal behavior can be controlled via a specific vector direction, offering a method to potentially mitigate or induce refusal.
How It Works
The project implements a pipeline to identify and apply a "refusal direction" in LLM activations. This direction is found by analyzing the difference in activations between harmful and harmless prompts. By manipulating activations along this direction, the model's propensity to refuse can be altered. This approach offers a targeted and efficient method for controlling LLM behavior compared to full fine-tuning.
Quick Start & Requirements
2 months ago
1 day