Discover and explore top open-source AI tools and projects—updated daily.
Refusal removal via HF Transformers
Top 32.7% on SourcePulse
This project provides a proof-of-concept implementation for removing harmful refusals from Large Language Models (LLMs) using only Hugging Face Transformers. It targets researchers and developers working on LLM safety and alignment, offering a method to bypass refusal mechanisms without relying on specialized libraries like TransformerLens.
How It Works
The approach involves modifying the model's internal states to steer it away from refusal responses. It leverages the flexibility of the Hugging Face Transformers library, allowing compatibility with any model supported by the library, provided its layer structure is accessible via model.model.layers
. This direct manipulation of model internals aims to achieve refusal removal efficiently.
Quick Start & Requirements
pip install transformers
.compute_refusal_dir.py
and inference.py
.Highlighted Details
Maintenance & Community
No specific community channels, roadmap, or notable contributors are mentioned in the README.
Licensing & Compatibility
The README does not specify a license. Compatibility for commercial use or closed-source linking is undetermined.
Limitations & Caveats
The implementation is a crude proof-of-concept and may not work with all models, particularly those with custom layer implementations (e.g., some Qwen variants). Compatibility with models larger than 3B parameters is not extensively tested.
1 year ago
Inactive