circuit-tracer  by safety-research

Tool for neural network circuit discovery

Created 3 months ago
2,341 stars

Top 19.5% on SourcePulse

GitHubView on GitHub
Project Summary

This library provides tools for finding, visualizing, and intervening on neural network "circuits" using cross-layer MLP transcoders. It's designed for researchers and practitioners in mechanistic interpretability seeking to understand model behavior by tracing feature activations and their causal effects.

How It Works

The library implements a three-step process: 1. Attribution: Computes the direct effect of input tokens, transcoder features, and error nodes on other features and output logits using MLP transcoders. 2. Graph Creation: Prunes the attribution graph based on influence thresholds and converts it to a JSON format for visualization. 3. Visualization & Intervention: Hosts a local web server to display and interact with the graph, allowing users to annotate features and perform interventions by setting transcoder features to specific values.

Quick Start & Requirements

  • Install via pip install . after cloning the repository.
  • Requires Python and PyTorch.
  • Demos are available as Jupyter notebooks, runnable on Colab (GPU recommended) or locally.
  • Working with Gemma-2 (2B) is possible with ~15GB GPU RAM; larger models or batch sizes require more.
  • Official tutorial: demos/circuit_tracing_tutorial.ipynb
  • CLI usage example: circuit-tracer attribute --prompt "..." --transcoder_set gemma --slug gemma-demo --graph_file_dir ./graph_files --server

Highlighted Details

  • Supports Gemma-2 (2B) and Llama-3.2 (1B) models with provided transcoder sets.
  • Offers a web-based visualization interface for exploring attribution graphs.
  • Enables direct model interventions by manipulating transcoder features.
  • CLI for end-to-end circuit finding, pruning, and visualization.

Maintenance & Community

  • Developed by researchers from safety-research.
  • Citation details provided for academic use.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • Interventions are currently only supported when using the library via a script or notebook, not through the Neuronpedia interface.
  • The Llama demo is not supported on Colab.
  • Full support for custom transcoder configurations is noted as "coming soon."
Health Check
Last Commit

2 weeks ago

Responsiveness

1 week

Pull Requests (30d)
5
Issues (30d)
4
Star History
71 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
4 more.

automated-interpretability by openai

0.1%
1k
Code and datasets for automated interpretability research
Created 2 years ago
Updated 1 year ago
Starred by Anastasios Angelopoulos Anastasios Angelopoulos(Cofounder of LMArena), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

transformer-debugger by openai

0.1%
4k
Tool for language model behavior investigation
Created 1 year ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Neel Nanda Neel Nanda(Research Scientist at Google DeepMind), and
1 more.

TransformerLens by TransformerLensOrg

1.0%
3k
Library for mechanistic interpretability research on GPT-style language models
Created 3 years ago
Updated 1 day ago
Feedback? Help us improve.