sparse_coding  by HoagyC

Sparse coding for interpretable language model features

Created 2 years ago
269 stars

Top 95.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides code for applying sparse coding to activation vectors in language models, enabling the discovery of interpretable features. It is targeted at researchers and practitioners interested in understanding the internal representations of neural networks. The primary benefit is the ability to train and interpret sparse autoencoders, offering insights into model behavior.

How It Works

The project utilizes sparse autoencoders to learn distributed representations from language model activations. It supports simultaneous training of multiple autoencoders with varying L1 sparsity values, deployable on single or multiple GPUs. The interpret.py script leverages OpenAI's automatic interpretation protocol for analyzing learned dictionaries against various baseline methods like PCA, ICA, NMF, and random projections.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.x, PyTorch, NumPy, SciPy, Pandas, Matplotlib, Hugging Face Transformers, OpenAI API key (for automatic interpretation). GPU recommended for training.
  • Resources: Training can be resource-intensive, especially for large models and datasets.
  • Links: Paper

Highlighted Details

  • Simultaneous training of multiple sparse autoencoders with different L1 values.
  • Automatic interpretation of learned dictionaries using OpenAI's protocol.
  • Comparison of learned features against various baseline methods (PCA, ICA, NMF, etc.).
  • Code used for results in the paper "Sparse Autoencoders Find Highly Interpretable Features in Language Models."

Maintenance & Community

The project is associated with Logan Riggs, Aidan Ewart, and Lee Sharkey. Further development is recommended via the sparse_autoencoder library.

Licensing & Compatibility

The repository does not explicitly state a license. Users should verify compatibility for commercial or closed-source use.

Limitations & Caveats

The project is primarily research code and may require adaptation for production environments. The sparse_autoencoder library is recommended for easier use and adherence to best practices, indicating potential for ongoing changes and improvements.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.