sparse_coding by HoagyC

Sparse coding for interpretable language model features

Created 2 years ago

290 stars

Top 91.0% on SourcePulse

Project Summary

This repository provides code for applying sparse coding to activation vectors in language models, enabling the discovery of interpretable features. It is targeted at researchers and practitioners interested in understanding the internal representations of neural networks. The primary benefit is the ability to train and interpret sparse autoencoders, offering insights into model behavior.

How It Works

The project utilizes sparse autoencoders to learn distributed representations from language model activations. It supports simultaneous training of multiple autoencoders with varying L1 sparsity values, deployable on single or multiple GPUs. The interpret.py script leverages OpenAI's automatic interpretation protocol for analyzing learned dictionaries against various baseline methods like PCA, ICA, NMF, and random projections.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: Python 3.x, PyTorch, NumPy, SciPy, Pandas, Matplotlib, Hugging Face Transformers, OpenAI API key (for automatic interpretation). GPU recommended for training.
Resources: Training can be resource-intensive, especially for large models and datasets.
Links: Paper

Highlighted Details

Simultaneous training of multiple sparse autoencoders with different L1 values.
Automatic interpretation of learned dictionaries using OpenAI's protocol.
Comparison of learned features against various baseline methods (PCA, ICA, NMF, etc.).
Code used for results in the paper "Sparse Autoencoders Find Highly Interpretable Features in Language Models."

Maintenance & Community

The project is associated with Logan Riggs, Aidan Ewart, and Lee Sharkey. Further development is recommended via the sparse_autoencoder library.

Licensing & Compatibility

The repository does not explicitly state a license. Users should verify compatibility for commercial or closed-source use.

Limitations & Caveats

The project is primarily research code and may require adaptation for production environments. The sparse_autoencoder library is recommended for easier use and adherence to best practices, indicating potential for ongoing changes and improvements.

sparse_coding by HoagyC

Explore Similar Projects

OV-DINO by wanghao9610

History-of-Deep-Learning by saurabhaloneai

ArchScale by microsoft

molmo by allenai

dictionary_learning by saprmarks

scipy2023-deeplearning by rasbt

SAELens by decoderesearch

UER-py by dbiir

fairseq-lua by facebookresearch

awesome-multimodal-ml by pliang279

unilm by microsoft

fairseq by facebookresearch