delphi by EleutherAI

Automated interpretability for LLMs

Created 1 year ago

253 stars

Top 99.3% on SourcePulse

Project Summary

Summary

Delphi provides automated interpretability tools for large language models, specifically designed to generate and score explanations for sparse autoencoder (SAE) and transcoder features. It empowers researchers and engineers to understand the internal workings of LLMs by analyzing millions of learned features, offering insights into model behavior.

How It Works

The library automates the process of interpreting learned features within LLMs. It begins by caching model activations over large token sets. Explanations are then generated using configurable explainer models (run locally via VLLM or via OpenRouter API), leveraging a novel ContrastiveExplainer that incorporates both activating and non-activating examples. For enhanced explanation quality, it integrates FAISS for efficient semantic similarity search to create "hard negative" examples, ensuring features are specific to their activations. Finally, generated explanations are scored using various metrics like detection, recall, and fuzzing.

Quick Start & Requirements

Installation is performed via a local editable install: pip install -e . from the project directory. A typical workflow involves caching activations (e.g., 10M tokens from EleutherAI/SmolLM2-135M-10B), generating explanations for specified model hookpoints and features, and scoring them. The command-line interface provides a streamlined way to run the default pipeline: python -m delphi EleutherAI/pythia-160m EleutherAI/Pythia-160m-SST-k32-32k --n_tokens 10_000_000 --max_latents 100 --hookpoints layers.5.mlp --scorers detection --filter_bos --name llama-3-8B. Programmatic usage is also supported. For reproducing experiments from the associated article, the article_version branch is recommended.

Highlighted Details

Automated generation and scoring of explanations for millions of SAE/transcoder features.
Support for both local VLLM inference and remote OpenRouter API for explanation generation.
FAISS integration for constructing semantically relevant "hard negative" examples to improve feature specificity.
ContrastiveExplainer approach that utilizes positive and negative examples to refine explanations.
Multiple scoring mechanisms including detection, recall, fuzzing, surprisal, and embedding-based retrieval.

Maintenance & Community

The codebase is under active development, with ongoing improvements. The article_version branch is maintained for reproducibility of published experiments. No specific community channels (e.g., Discord, Slack) or sponsorship details were found in the provided text.

Licensing & Compatibility

The project is licensed under the Apache License, Version 2.0. This permissive license generally allows for commercial use and integration into closed-source projects without significant copyleft restrictions.

Limitations & Caveats

The main branch is subject to active development and may have usage differences compared to the article_version branch, which is specifically designated for reproducing experimental results. Users aiming for reproducibility should utilize the article_version branch.

delphi by EleutherAI

Explore Similar Projects

Awesome-LLM-Interpretability by cooperleong00

discovering_latent_knowledge by collin-burns

cwm by facebookresearch

Quantus by understandable-machine-intelligence-lab

neuronpedia by hijohnnylin

awesome-llm-interpretability by JShollaj

aideml by WecoAI

SAELens by decoderesearch

transformer-debugger by openai

experiments by SWE-bench

captum by meta-pytorch

TransformerLens by TransformerLensOrg