abliterator by FailSpy

Python library for LLM feature ablation using TransformerLens

Created 1 year ago

536 stars

Top 59.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Maxime Labonne

Head of Post-Training at Liquid AI

Project Summary

This library provides a streamlined Python workflow for feature ablation in Large Language Models (LLMs) using TransformerLens. It's designed for researchers and practitioners who need to systematically identify and modify specific model behaviors, such as reducing harmful outputs, by caching activations and calculating "refusal directions." The primary benefit is accelerating experimental iteration and reducing boilerplate code for ablation studies.

How It Works

The library leverages TransformerLens to access and cache intermediate activations from specified layers (e.g., residual streams). It then computes "refusal directions" based on the differences in activations between "harmful" and "harmless" datasets. These directions represent vectors in activation space that, when applied as weight modifications, aim to steer the model's behavior. The workflow emphasizes iterative testing and application of these directions, with utilities for saving and restoring model states and activations.

Quick Start & Requirements

Install via pip: pip install abliterator
Requires Python 3.8+ and PyTorch.
GPU with CUDA is highly recommended for performance.
TransformerLens is a core dependency.
Example usage and model loading are provided in the README.

Highlighted Details

Built-in utilities for caching activations from N samples.
Integrated calculation of refusal directions.
Functions for testing and applying identified directions.
Support for custom chat templates and negative/positive tokens.
Ability to blacklist/whitelist specific layers to prevent unintended modifications.

Maintenance & Community

The project is currently maintained by the sole author, FailSpy, with an explicit goal of community contribution to improve documentation and expand functionality.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The library is described as "barebones" with slim documentation. Functionality for saving as a HuggingFace model is noted as "coming soon." The current focus is on personal workflow, and broader utility is a future goal.

abliterator by FailSpy

Explore Similar Projects

dots.llm1 by rednote-hilab

simplified_transformers by bobby-he

prompt-lookup-decoding by apoorvumang

Auto1111SDK by Auto1111SDK

prompt-optimizer by vaibkumr

HALOs by ContextualAI

memit by kmeng01

tomesd by dbolya

blt by facebookresearch

R-KV by Zefan-Cai

LLMLingua by microsoft

RedPajama-Data by togethercomputer