refusal_direction by andyrdt

Research paper code for analyzing refusal in language models

Created 1 year ago

304 stars

Top 88.0% on SourcePulse

2 Experts Love This Project

winglian

Founder of Axolotl AI

mlabonne

Head of Post-Training at Liquid AI

Project Summary

This repository provides code and results for the paper "Refusal in Language Models Is Mediated by a Single Direction." It enables researchers and practitioners to reproduce findings on how LLM refusal behavior can be controlled via a specific vector direction, offering a method to potentially mitigate or induce refusal.

How It Works

The project implements a pipeline to identify and apply a "refusal direction" in LLM activations. This direction is found by analyzing the difference in activations between harmful and harmless prompts. By manipulating activations along this direction, the model's propensity to refuse can be altered. This approach offers a targeted and efficient method for controlling LLM behavior compared to full fine-tuning.

Quick Start & Requirements

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

10 stars in the last 30 days

Explore Similar Projects

LLM-for-misinformation-research by ICTMCG

Paper list for misinformation research using large language models

Created 1 year ago

Updated 11 months ago

Starred by

Lewis Tunstall

Lewis Tunstall(Research Engineer at Hugging Face) and

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

chatgpt-failures by giuven95

LLM failure archive for ChatGPT and similar models

Created 2 years ago

Updated 2 years ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI).

agent-ci by pegasi-ai

AI testing framework for LLM output validation

Created 2 years ago

Updated 3 weeks ago

Awesome-LLM-Uncertainty-Reliability-Robustness by jxzhangjhu

Curated list of LLM uncertainty, reliability, and robustness resources

Created 2 years ago

Updated 6 months ago

semantic_uncertainty by jlko

Code for reproducing semantic uncertainty research paper experiments

Created 1 year ago

Updated 1 year ago

fmeval by aws

Evaluate foundation models for various NLP tasks

Created 2 years ago

Updated 3 months ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

LiveBench by LiveBench

LLM benchmark for evaluating models on recently released data

Created 1 year ago

Updated 4 days ago

promptmap by utkusen

Prompt injection scanner for LLM apps

Created 2 years ago

Updated 1 month ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Luca Soldaini

Luca Soldaini(Research Scientist at Ai2), and

7 more.

hh-rlhf by anthropics

RLHF dataset for training safe AI assistants

Created 3 years ago

Updated 5 months ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA),

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI), and

3 more.

promptbench by microsoft

LLM evaluation framework

Created 2 years ago

Updated 1 month ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI),

Simon Willison

Simon Willison(Coauthor of Django), and

8 more.

inspect_ai by UKGovernmentBEIS

Framework for large language model evaluations

Created 2 years ago

Updated 1 day ago

Starred by

Lilian Weng

Lilian Weng(Cofounder of Thinking Machines Lab),

Bojan Tunguz

Bojan Tunguz(AI Scientist; Formerly at NVIDIA), and

21 more.

Prompt-Engineering-Guide by dair-ai

Prompt engineering resource for language model (LLM) applications

Created 3 years ago

Updated 1 week ago

Feedback? Help us improve.