delphi  by EleutherAI

Automated interpretability for LLMs

Created 1 year ago
253 stars

Top 99.3% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Delphi provides automated interpretability tools for large language models, specifically designed to generate and score explanations for sparse autoencoder (SAE) and transcoder features. It empowers researchers and engineers to understand the internal workings of LLMs by analyzing millions of learned features, offering insights into model behavior.

How It Works

The library automates the process of interpreting learned features within LLMs. It begins by caching model activations over large token sets. Explanations are then generated using configurable explainer models (run locally via VLLM or via OpenRouter API), leveraging a novel ContrastiveExplainer that incorporates both activating and non-activating examples. For enhanced explanation quality, it integrates FAISS for efficient semantic similarity search to create "hard negative" examples, ensuring features are specific to their activations. Finally, generated explanations are scored using various metrics like detection, recall, and fuzzing.

Quick Start & Requirements

Installation is performed via a local editable install: pip install -e . from the project directory. A typical workflow involves caching activations (e.g., 10M tokens from EleutherAI/SmolLM2-135M-10B), generating explanations for specified model hookpoints and features, and scoring them. The command-line interface provides a streamlined way to run the default pipeline: python -m delphi EleutherAI/pythia-160m EleutherAI/Pythia-160m-SST-k32-32k --n_tokens 10_000_000 --max_latents 100 --hookpoints layers.5.mlp --scorers detection --filter_bos --name llama-3-8B. Programmatic usage is also supported. For reproducing experiments from the associated article, the article_version branch is recommended.

Highlighted Details

  • Automated generation and scoring of explanations for millions of SAE/transcoder features.
  • Support for both local VLLM inference and remote OpenRouter API for explanation generation.
  • FAISS integration for constructing semantically relevant "hard negative" examples to improve feature specificity.
  • ContrastiveExplainer approach that utilizes positive and negative examples to refine explanations.
  • Multiple scoring mechanisms including detection, recall, fuzzing, surprisal, and embedding-based retrieval.

Maintenance & Community

The codebase is under active development, with ongoing improvements. The article_version branch is maintained for reproducibility of published experiments. No specific community channels (e.g., Discord, Slack) or sponsorship details were found in the provided text.

Licensing & Compatibility

The project is licensed under the Apache License, Version 2.0. This permissive license generally allows for commercial use and integration into closed-source projects without significant copyleft restrictions.

Limitations & Caveats

The main branch is subject to active development and may have usage differences compared to the article_version branch, which is specifically designated for reproducing experimental results. Users aiming for reproducibility should utilize the article_version branch.

Health Check
Last Commit

22 hours ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Anastasios Angelopoulos Anastasios Angelopoulos(Cofounder of LMArena), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

transformer-debugger by openai

0.1%
4k
Tool for language model behavior investigation
Created 2 years ago
Updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Neel Nanda Neel Nanda(Research Scientist at Google DeepMind), and
1 more.

TransformerLens by TransformerLensOrg

0.9%
3k
Library for mechanistic interpretability research on GPT-style language models
Created 3 years ago
Updated 3 days ago
Feedback? Help us improve.