Python library for LLM feature ablation using TransformerLens
Top 63.6% on sourcepulse
This library provides a streamlined Python workflow for feature ablation in Large Language Models (LLMs) using TransformerLens. It's designed for researchers and practitioners who need to systematically identify and modify specific model behaviors, such as reducing harmful outputs, by caching activations and calculating "refusal directions." The primary benefit is accelerating experimental iteration and reducing boilerplate code for ablation studies.
How It Works
The library leverages TransformerLens to access and cache intermediate activations from specified layers (e.g., residual streams). It then computes "refusal directions" based on the differences in activations between "harmful" and "harmless" datasets. These directions represent vectors in activation space that, when applied as weight modifications, aim to steer the model's behavior. The workflow emphasizes iterative testing and application of these directions, with utilities for saving and restoring model states and activations.
Quick Start & Requirements
pip install abliterator
Highlighted Details
Maintenance & Community
The project is currently maintained by the sole author, FailSpy, with an explicit goal of community contribution to improve documentation and expand functionality.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The library is described as "barebones" with slim documentation. Functionality for saving as a HuggingFace model is noted as "coming soon." The current focus is on personal workflow, and broader utility is a future goal.
1 year ago
1 day