Discover and explore top open-source AI tools and projects—updated daily.
Sparse coding for interpretable language model features
Top 95.5% on SourcePulse
This repository provides code for applying sparse coding to activation vectors in language models, enabling the discovery of interpretable features. It is targeted at researchers and practitioners interested in understanding the internal representations of neural networks. The primary benefit is the ability to train and interpret sparse autoencoders, offering insights into model behavior.
How It Works
The project utilizes sparse autoencoders to learn distributed representations from language model activations. It supports simultaneous training of multiple autoencoders with varying L1 sparsity values, deployable on single or multiple GPUs. The interpret.py
script leverages OpenAI's automatic interpretation protocol for analyzing learned dictionaries against various baseline methods like PCA, ICA, NMF, and random projections.
Quick Start & Requirements
pip install -r requirements.txt
Highlighted Details
Maintenance & Community
The project is associated with Logan Riggs, Aidan Ewart, and Lee Sharkey. Further development is recommended via the sparse_autoencoder
library.
Licensing & Compatibility
The repository does not explicitly state a license. Users should verify compatibility for commercial or closed-source use.
Limitations & Caveats
The project is primarily research code and may require adaptation for production environments. The sparse_autoencoder
library is recommended for easier use and adherence to best practices, indicating potential for ongoing changes and improvements.
1 year ago
Inactive