End-to-end pipeline for LLM interpretability using Llama 3 and sparse autoencoders
Top 54.0% on sourcepulse
This project provides a complete, reproducible pipeline for interpreting Large Language Models (LLMs) using Sparse Autoencoders (SAEs) with Llama 3.2. It targets researchers and engineers interested in mechanistic interpretability, offering tools to untangle complex neural representations into distinct, understandable concepts, aiding in tasks like understanding model behavior and detecting hallucinations.
How It Works
The project implements Sparse Autoencoders (SAEs) by projecting LLM activations into a large, sparsely activated latent space. This approach aims to disentangle superimposed representations within neurons into distinct, monosemantic features. The pipeline includes capturing residual activations from Llama 3.2, preprocessing this data, training SAEs with an auxiliary loss to prevent dead neurons, and analyzing learned features by identifying sentences that maximally activate specific latents.
Quick Start & Requirements
curl -sSL https://install.python-poetry.org | python3.12 -
git clone https://github.com/PaulPauls/llama3_interpretability_sae
cd llama3_interpretability_sae && poetry install --sync
Highlighted Details
Maintenance & Community
This is a non-profit side project with ongoing development. Contributions and feedback are welcomed. Links to community channels are not specified in the README.
Licensing & Compatibility
The project is released under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The project is in version 0.2 and is not final. While the SAE training is robust, feature steering capabilities are noted as not particularly strong in this initial release and produce inconsistent results. The dataset size is smaller than those used in comparable academic research.
4 months ago
1 day