sparse_autoencoder  by openai

Sparse autoencoder for GPT2-small activation analysis

created 1 year ago
505 stars

Top 62.5% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides sparse autoencoders (SAEs) trained on GPT-2 small model activations, along with a visualizer for exploring learned features. It is intended for researchers and practitioners interested in interpretability and understanding the internal workings of large language models. The project offers pre-trained SAEs and code for training and visualizing them, enabling deeper insights into how LLMs represent information.

How It Works

The project implements sparse autoencoders, a type of neural network designed to learn compressed representations of input data while promoting sparsity in the activations of its hidden layer. This sparsity encourages individual neurons to specialize in detecting specific features or concepts within the model's activations. The architecture is detailed in model.py, with training code in train.py.

Quick Start & Requirements

  • Install via pip: pip install git+https://github.com/openai/sparse_autoencoder.git
  • Prerequisites: Python, PyTorch, transformer_lens, blobfile.
  • Example usage and detailed integration with transformer_lens are provided in the README.

Highlighted Details

  • Pre-trained SAEs available for GPT-2 small activations.
  • Feature visualization tool (sae-viewer) for exploring learned representations.
  • Codebase includes model architecture, training, and utility scripts.

Maintenance & Community

This project is from OpenAI. Further community engagement details are not specified in the README.

Licensing & Compatibility

The repository does not explicitly state a license.

Limitations & Caveats

The project focuses specifically on GPT-2 small activations; compatibility with other models or architectures is not guaranteed. The README does not detail performance benchmarks or specific limitations of the autoencoders themselves.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
49 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.