Sparse autoencoder research code for neural network activations
Top 85.5% on sourcepulse
This repository provides tools for training and evaluating sparse autoencoders (SAEs) on neural network activations, primarily for interpretability research. It targets researchers and practitioners working with large language models who want to understand and manipulate internal representations. The library offers a flexible framework for various SAE architectures and training protocols, along with pre-trained dictionaries for the Pythia-70m-deduped model.
How It Works
The library implements several SAE architectures (standard, Gated, TopK, BatchTopK, JumpReLU) each with a corresponding trainer. It utilizes an ActivationBuffer
to efficiently collect and batch activations from specified model submodules using the nnsight
library. Training protocols include options for L1 regularization, neuron resampling, learning rate warmup/decay, and sparsity penalty warmup. Activations can be normalized for better hyperparameter transfer.
Quick Start & Requirements
pip install dictionary-learning
Highlighted Details
ActivationBuffer
for efficient data handling.Maintenance & Community
nnsight
package is under active development and may have breaking changes.Licensing & Compatibility
Limitations & Caveats
nnsight
is under active development, potentially leading to breaking changes.sae_lens
(currently only JumpReLU).2 weeks ago
1+ week