circuit-tracer by safety-research

Tool for neural network circuit discovery

Created 7 months ago

2,529 stars

Top 18.3% on SourcePulse

View on GitHub

3 Experts Love This Project

Shawn Wang

Editor of Latent Space

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Simon Willison

Coauthor of Django

Project Summary

This library provides tools for finding, visualizing, and intervening on neural network "circuits" using cross-layer MLP transcoders. It's designed for researchers and practitioners in mechanistic interpretability seeking to understand model behavior by tracing feature activations and their causal effects.

How It Works

The library implements a three-step process: 1. Attribution: Computes the direct effect of input tokens, transcoder features, and error nodes on other features and output logits using MLP transcoders. 2. Graph Creation: Prunes the attribution graph based on influence thresholds and converts it to a JSON format for visualization. 3. Visualization & Intervention: Hosts a local web server to display and interact with the graph, allowing users to annotate features and perform interventions by setting transcoder features to specific values.

Quick Start & Requirements

Install via pip install . after cloning the repository.
Requires Python and PyTorch.
Demos are available as Jupyter notebooks, runnable on Colab (GPU recommended) or locally.
Working with Gemma-2 (2B) is possible with ~15GB GPU RAM; larger models or batch sizes require more.
Official tutorial: demos/circuit_tracing_tutorial.ipynb
CLI usage example: circuit-tracer attribute --prompt "..." --transcoder_set gemma --slug gemma-demo --graph_file_dir ./graph_files --server

Highlighted Details

Supports Gemma-2 (2B) and Llama-3.2 (1B) models with provided transcoder sets.
Offers a web-based visualization interface for exploring attribution graphs.
Enables direct model interventions by manipulating transcoder features.
CLI for end-to-end circuit finding, pruning, and visualization.

Maintenance & Community

Developed by researchers from safety-research.
Citation details provided for academic use.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

Interventions are currently only supported when using the library via a script or notebook, not through the Neuronpedia interface.
The Llama demo is not supported on Colab.
Full support for custom transcoder configurations is noted as "coming soon."

Health Check

Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

49 stars in the last 30 days