sparse_autoencoder by openai

Sparse autoencoder for GPT2-small activation analysis

Created 1 year ago

561 stars

Top 57.1% on SourcePulse

View on GitHub

3 Experts Love This Project

Travis Fischer

Founder of Agentic

Vincent Weisser

Cofounder of Prime Intellect

Evan Hubinger

Head of Alignment Stress-Testing at Anthropic

Project Summary

This repository provides sparse autoencoders (SAEs) trained on GPT-2 small model activations, along with a visualizer for exploring learned features. It is intended for researchers and practitioners interested in interpretability and understanding the internal workings of large language models. The project offers pre-trained SAEs and code for training and visualizing them, enabling deeper insights into how LLMs represent information.

How It Works

The project implements sparse autoencoders, a type of neural network designed to learn compressed representations of input data while promoting sparsity in the activations of its hidden layer. This sparsity encourages individual neurons to specialize in detecting specific features or concepts within the model's activations. The architecture is detailed in model.py, with training code in train.py.

Quick Start & Requirements

Install via pip: pip install git+https://github.com/openai/sparse_autoencoder.git
Prerequisites: Python, PyTorch, transformer_lens, blobfile.
Example usage and detailed integration with transformer_lens are provided in the README.

Highlighted Details

Pre-trained SAEs available for GPT-2 small activations.
Feature visualization tool (sae-viewer) for exploring learned representations.
Codebase includes model architecture, training, and utility scripts.

Maintenance & Community

This project is from OpenAI. Further community engagement details are not specified in the README.

Licensing & Compatibility

The repository does not explicitly state a license.

Limitations & Caveats

The project focuses specifically on GPT-2 small activations; compatibility with other models or architectures is not guaranteed. The README does not detail performance benchmarks or specific limitations of the autoencoders themselves.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days