llama3_interpretability_sae  by PaulPauls

End-to-end pipeline for LLM interpretability using Llama 3 and sparse autoencoders

created 8 months ago
620 stars

Top 54.0% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a complete, reproducible pipeline for interpreting Large Language Models (LLMs) using Sparse Autoencoders (SAEs) with Llama 3.2. It targets researchers and engineers interested in mechanistic interpretability, offering tools to untangle complex neural representations into distinct, understandable concepts, aiding in tasks like understanding model behavior and detecting hallucinations.

How It Works

The project implements Sparse Autoencoders (SAEs) by projecting LLM activations into a large, sparsely activated latent space. This approach aims to disentangle superimposed representations within neurons into distinct, monosemantic features. The pipeline includes capturing residual activations from Llama 3.2, preprocessing this data, training SAEs with an auxiliary loss to prevent dead neurons, and analyzing learned features by identifying sentences that maximally activate specific latents.

Quick Start & Requirements

  • Install Poetry: curl -sSL https://install.python-poetry.org | python3.12 -
  • Clone repository: git clone https://github.com/PaulPauls/llama3_interpretability_sae
  • Install dependencies: cd llama3_interpretability_sae && poetry install --sync
  • Requires Python 3.12, PyTorch, and Weights & Biases for logging.
  • Pre-captured activations (3.2TB) and a trained SAE model are available for download.

Highlighted Details

  • Pure PyTorch implementation of Llama 3.1/3.2 inference without external dependencies.
  • Sentence-level activation capture from a custom OpenWebText dataset variant.
  • SAE training utilizes an auxiliary loss and gradient projection for stability and to revive dead latents.
  • Interpretability analysis uses Claude 3.5 to semantically analyze sentences that maximally activate latents.
  • Feature steering is demonstrated via a Gradio interface, showing potential for manipulating model output.

Maintenance & Community

This is a non-profit side project with ongoing development. Contributions and feedback are welcomed. Links to community channels are not specified in the README.

Licensing & Compatibility

The project is released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The project is in version 0.2 and is not final. While the SAE training is robust, feature steering capabilities are noted as not particularly strong in this initial release and produce inconsistent results. The dataset size is smaller than those used in comparable academic research.

Health Check
Last commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.