sae_vis by callummcdougall

Visualizations for sparse autoencoders

Created 2 years ago

262 stars

Top 97.0% on SourcePulse

Project Summary

Summary

This repository provides visualization tools for Sparse Autoencoders (SAEs), enabling researchers and engineers to analyze and understand the internal workings of these models. It offers feature-centric and prompt-centric views, replicating visualizations from Anthropic's published research, thereby facilitating deeper model interpretability and diagnostic capabilities for SAEs.

How It Works

The library offers two primary visualization modes designed for dissecting SAE behavior. The feature-centric view allows users to inspect individual features, identifying specific tokens or sequences from a dataset that maximally activate them, providing insight into what each feature "detects." Conversely, the prompt-centric view analyzes custom prompts, revealing which features are most influential for a given input according to various metrics, such as activation magnitude or impact on token prediction. This dual approach provides complementary perspectives for understanding SAEs' representational space and functional roles.

Quick Start & Requirements

Installation is straightforward via pip: pip install sae-vis. The project utilizes Poetry for dependency management, requiring poetry install after cloning the repository to set up the development environment. While no specific hardware prerequisites like GPUs are explicitly listed, a standard Python 3 environment is assumed. A demo Colab notebook is available, with its complete code included in the repository for reproduction and experimentation. Links to the PyPI package page and the original Anthropic visualizations are mentioned within the documentation.

Highlighted Details

Directly replicates and extends visualization techniques pioneered in Anthropic's SAE research.
Supports distinct feature-centric and prompt-centric analysis perspectives, offering complementary views of SAE functionality.
Version 0.3.0 introduced a significant refactor, enhancing capabilities with support for OthelloGPT SAEs, linear probes (input/output space), attention output SAEs, and detailed token-level visualizations, including the change in correct-token probability upon feature ablation.
Designed for compatibility and integration with the sae-lens library, a related project.

Maintenance & Community

The project is no longer actively maintained by its original author, who has shifted focus to a new role. However, the author remains open to accepting community contributions via Pull Requests (PRs). For users seeking more extensive development, ongoing iteration, and a broader suite of tools for working with SAEs, the SAELens library is explicitly recommended, as it builds upon and forks this repository.

Licensing & Compatibility

The specific open-source license governing this repository is not explicitly stated in the provided README text. This omission necessitates that potential adopters seek clarification regarding usage rights, particularly concerning commercial applications, derivative works, or integration into closed-source projects.

Limitations & Caveats

The primary limitation is the lack of active maintenance, meaning future updates, bug fixes, or feature enhancements are not guaranteed. Users are directed to the SAELens library for more current development and a more comprehensive feature set. Dependency management via Poetry may present a minor adoption hurdle for users unfamiliar with the tool compared to standard pip-based workflows.

sae_vis by callummcdougall

Explore Similar Projects

Awesome-LLM-Interpretability by cooperleong00

klarity by klara-research

interpret-text by interpretml

Streamline-Analyst by Wilson-ZheLin

Quantus by understandable-machine-intelligence-lab

OmniXAI by salesforce

sketch by approximatelabs

DALEX by ModelOriented

transformer-debugger by openai

SAELens by decoderesearch

lit by PAIR-code

captum by meta-pytorch