Sparse autoencoder for GPT2-small activation analysis
Top 62.5% on sourcepulse
This repository provides sparse autoencoders (SAEs) trained on GPT-2 small model activations, along with a visualizer for exploring learned features. It is intended for researchers and practitioners interested in interpretability and understanding the internal workings of large language models. The project offers pre-trained SAEs and code for training and visualizing them, enabling deeper insights into how LLMs represent information.
How It Works
The project implements sparse autoencoders, a type of neural network designed to learn compressed representations of input data while promoting sparsity in the activations of its hidden layer. This sparsity encourages individual neurons to specialize in detecting specific features or concepts within the model's activations. The architecture is detailed in model.py
, with training code in train.py
.
Quick Start & Requirements
pip install git+https://github.com/openai/sparse_autoencoder.git
transformer_lens
, blobfile
.transformer_lens
are provided in the README.Highlighted Details
sae-viewer
) for exploring learned representations.Maintenance & Community
This project is from OpenAI. Further community engagement details are not specified in the README.
Licensing & Compatibility
The repository does not explicitly state a license.
Limitations & Caveats
The project focuses specifically on GPT-2 small activations; compatibility with other models or architectures is not guaranteed. The README does not detail performance benchmarks or specific limitations of the autoencoders themselves.
1 year ago
1 week