abliterator  by FailSpy

Python library for LLM feature ablation using TransformerLens

created 1 year ago
492 stars

Top 63.6% on sourcepulse

GitHubView on GitHub
Project Summary

This library provides a streamlined Python workflow for feature ablation in Large Language Models (LLMs) using TransformerLens. It's designed for researchers and practitioners who need to systematically identify and modify specific model behaviors, such as reducing harmful outputs, by caching activations and calculating "refusal directions." The primary benefit is accelerating experimental iteration and reducing boilerplate code for ablation studies.

How It Works

The library leverages TransformerLens to access and cache intermediate activations from specified layers (e.g., residual streams). It then computes "refusal directions" based on the differences in activations between "harmful" and "harmless" datasets. These directions represent vectors in activation space that, when applied as weight modifications, aim to steer the model's behavior. The workflow emphasizes iterative testing and application of these directions, with utilities for saving and restoring model states and activations.

Quick Start & Requirements

  • Install via pip: pip install abliterator
  • Requires Python 3.8+ and PyTorch.
  • GPU with CUDA is highly recommended for performance.
  • TransformerLens is a core dependency.
  • Example usage and model loading are provided in the README.

Highlighted Details

  • Built-in utilities for caching activations from N samples.
  • Integrated calculation of refusal directions.
  • Functions for testing and applying identified directions.
  • Support for custom chat templates and negative/positive tokens.
  • Ability to blacklist/whitelist specific layers to prevent unintended modifications.

Maintenance & Community

The project is currently maintained by the sole author, FailSpy, with an explicit goal of community contribution to improve documentation and expand functionality.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The library is described as "barebones" with slim documentation. Functionality for saving as a HuggingFace model is noted as "coming soon." The current focus is on personal workflow, and broader utility is a future goal.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
44 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

HALOs by ContextualAI

0.2%
873
Library for aligning LLMs using human-aware loss functions
created 1 year ago
updated 2 weeks ago
Starred by Dominik Moritz Dominik Moritz(Professor at CMU; ML Researcher at Apple), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
2 more.

ecco by jalammar

0%
2k
Python library for interactive NLP model visualization in Jupyter notebooks
created 4 years ago
updated 11 months ago
Feedback? Help us improve.