Discover and explore top open-source AI tools and projects—updated daily.
EleutherAIAutomated interpretability for LLMs
Top 99.3% on SourcePulse
Summary
Delphi provides automated interpretability tools for large language models, specifically designed to generate and score explanations for sparse autoencoder (SAE) and transcoder features. It empowers researchers and engineers to understand the internal workings of LLMs by analyzing millions of learned features, offering insights into model behavior.
How It Works
The library automates the process of interpreting learned features within LLMs. It begins by caching model activations over large token sets. Explanations are then generated using configurable explainer models (run locally via VLLM or via OpenRouter API), leveraging a novel ContrastiveExplainer that incorporates both activating and non-activating examples. For enhanced explanation quality, it integrates FAISS for efficient semantic similarity search to create "hard negative" examples, ensuring features are specific to their activations. Finally, generated explanations are scored using various metrics like detection, recall, and fuzzing.
Quick Start & Requirements
Installation is performed via a local editable install: pip install -e . from the project directory. A typical workflow involves caching activations (e.g., 10M tokens from EleutherAI/SmolLM2-135M-10B), generating explanations for specified model hookpoints and features, and scoring them. The command-line interface provides a streamlined way to run the default pipeline: python -m delphi EleutherAI/pythia-160m EleutherAI/Pythia-160m-SST-k32-32k --n_tokens 10_000_000 --max_latents 100 --hookpoints layers.5.mlp --scorers detection --filter_bos --name llama-3-8B. Programmatic usage is also supported. For reproducing experiments from the associated article, the article_version branch is recommended.
Highlighted Details
Maintenance & Community
The codebase is under active development, with ongoing improvements. The article_version branch is maintained for reproducibility of published experiments. No specific community channels (e.g., Discord, Slack) or sponsorship details were found in the provided text.
Licensing & Compatibility
The project is licensed under the Apache License, Version 2.0. This permissive license generally allows for commercial use and integration into closed-source projects without significant copyleft restrictions.
Limitations & Caveats
The main branch is subject to active development and may have usage differences compared to the article_version branch, which is specifically designated for reproducing experimental results. Users aiming for reproducibility should utilize the article_version branch.
22 hours ago
Inactive
collin-burns
WecoAI
openai
meta-pytorch
TransformerLensOrg