abliterator  by FailSpy

Python library for LLM feature ablation using TransformerLens

Created 1 year ago
514 stars

Top 60.9% on SourcePulse

GitHubView on GitHub
Project Summary

This library provides a streamlined Python workflow for feature ablation in Large Language Models (LLMs) using TransformerLens. It's designed for researchers and practitioners who need to systematically identify and modify specific model behaviors, such as reducing harmful outputs, by caching activations and calculating "refusal directions." The primary benefit is accelerating experimental iteration and reducing boilerplate code for ablation studies.

How It Works

The library leverages TransformerLens to access and cache intermediate activations from specified layers (e.g., residual streams). It then computes "refusal directions" based on the differences in activations between "harmful" and "harmless" datasets. These directions represent vectors in activation space that, when applied as weight modifications, aim to steer the model's behavior. The workflow emphasizes iterative testing and application of these directions, with utilities for saving and restoring model states and activations.

Quick Start & Requirements

  • Install via pip: pip install abliterator
  • Requires Python 3.8+ and PyTorch.
  • GPU with CUDA is highly recommended for performance.
  • TransformerLens is a core dependency.
  • Example usage and model loading are provided in the README.

Highlighted Details

  • Built-in utilities for caching activations from N samples.
  • Integrated calculation of refusal directions.
  • Functions for testing and applying identified directions.
  • Support for custom chat templates and negative/positive tokens.
  • Ability to blacklist/whitelist specific layers to prevent unintended modifications.

Maintenance & Community

The project is currently maintained by the sole author, FailSpy, with an explicit goal of community contribution to improve documentation and expand functionality.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The library is described as "barebones" with slim documentation. Functionality for saving as a HuggingFace model is noted as "coming soon." The current focus is on personal workflow, and broader utility is a future goal.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dots.llm1 by rednote-hilab

0.2%
466
MoE model for research
Created 5 months ago
Updated 1 month ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
3 more.

prompt-lookup-decoding by apoorvumang

0%
572
Decoding method for faster LLM generation
Created 1 year ago
Updated 1 year ago
Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral) and Jiaming Song Jiaming Song(Chief Scientist at Luma AI).

tomesd by dbolya

0%
1k
Speed-up tool for Stable Diffusion
Created 2 years ago
Updated 1 year ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
1 more.

blt by facebookresearch

0.1%
2k
Code for Byte Latent Transformer research paper
Created 10 months ago
Updated 4 months ago
Feedback? Help us improve.