remove-refusals-with-transformers by Sumandora

Refusal removal via HF Transformers

Created 1 year ago

1,318 stars

Top 30.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Maxime Labonne

Head of Post-Training at Liquid AI

Project Summary

This project provides a proof-of-concept implementation for removing harmful refusals from Large Language Models (LLMs) using only Hugging Face Transformers. It targets researchers and developers working on LLM safety and alignment, offering a method to bypass refusal mechanisms without relying on specialized libraries like TransformerLens.

How It Works

The approach involves modifying the model's internal states to steer it away from refusal responses. It leverages the flexibility of the Hugging Face Transformers library, allowing compatibility with any model supported by the library, provided its layer structure is accessible via model.model.layers. This direct manipulation of model internals aims to achieve refusal removal efficiently.

Quick Start & Requirements

Install via pip install transformers.
Requires Python and a compatible Hugging Face Transformers model.
Tested on RTX 2060 6GB, suggesting suitability for models under 3B parameters, though larger models may also work.
Configuration is done within compute_refusal_dir.py and inference.py.

Highlighted Details

Pure Hugging Face Transformers implementation, maximizing model compatibility.
Proof-of-concept for refusal removal without TransformerLens.
Tested with models up to 3B parameters, with broader compatibility claimed.

Maintenance & Community

No specific community channels, roadmap, or notable contributors are mentioned in the README.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

The implementation is a crude proof-of-concept and may not work with all models, particularly those with custom layer implementations (e.g., some Qwen variants). Compatibility with models larger than 3B parameters is not extensively tested.

Health Check

Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

90 stars in the last 30 days