llama3_interpretability_sae by PaulPauls

End-to-end pipeline for LLM interpretability using Llama 3 and sparse autoencoders

Created 1 year ago

626 stars

Top 52.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Stas Kelvich

Cofounder of Neon

Project Summary

This project provides a complete, reproducible pipeline for interpreting Large Language Models (LLMs) using Sparse Autoencoders (SAEs) with Llama 3.2. It targets researchers and engineers interested in mechanistic interpretability, offering tools to untangle complex neural representations into distinct, understandable concepts, aiding in tasks like understanding model behavior and detecting hallucinations.

How It Works

The project implements Sparse Autoencoders (SAEs) by projecting LLM activations into a large, sparsely activated latent space. This approach aims to disentangle superimposed representations within neurons into distinct, monosemantic features. The pipeline includes capturing residual activations from Llama 3.2, preprocessing this data, training SAEs with an auxiliary loss to prevent dead neurons, and analyzing learned features by identifying sentences that maximally activate specific latents.

Quick Start & Requirements

Install Poetry: curl -sSL https://install.python-poetry.org | python3.12 -
Clone repository: git clone https://github.com/PaulPauls/llama3_interpretability_sae
Install dependencies: cd llama3_interpretability_sae && poetry install --sync
Requires Python 3.12, PyTorch, and Weights & Biases for logging.
Pre-captured activations (3.2TB) and a trained SAE model are available for download.

Highlighted Details

Pure PyTorch implementation of Llama 3.1/3.2 inference without external dependencies.
Sentence-level activation capture from a custom OpenWebText dataset variant.
SAE training utilizes an auxiliary loss and gradient projection for stability and to revive dead latents.
Interpretability analysis uses Claude 3.5 to semantically analyze sentences that maximally activate latents.
Feature steering is demonstrated via a Gradio interface, showing potential for manipulating model output.

Maintenance & Community

This is a non-profit side project with ongoing development. Contributions and feedback are welcomed. Links to community channels are not specified in the README.

Licensing & Compatibility

The project is released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The project is in version 0.2 and is not final. While the SAE training is robust, feature steering capabilities are noted as not particularly strong in this initial release and produce inconsistent results. The dataset size is smaller than those used in comparable academic research.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days