circuit-tracer  by safety-research

Tool for neural network circuit discovery

created 2 months ago
2,210 stars

Top 20.9% on sourcepulse

GitHubView on GitHub
Project Summary

This library provides tools for finding, visualizing, and intervening on neural network "circuits" using cross-layer MLP transcoders. It's designed for researchers and practitioners in mechanistic interpretability seeking to understand model behavior by tracing feature activations and their causal effects.

How It Works

The library implements a three-step process: 1. Attribution: Computes the direct effect of input tokens, transcoder features, and error nodes on other features and output logits using MLP transcoders. 2. Graph Creation: Prunes the attribution graph based on influence thresholds and converts it to a JSON format for visualization. 3. Visualization & Intervention: Hosts a local web server to display and interact with the graph, allowing users to annotate features and perform interventions by setting transcoder features to specific values.

Quick Start & Requirements

  • Install via pip install . after cloning the repository.
  • Requires Python and PyTorch.
  • Demos are available as Jupyter notebooks, runnable on Colab (GPU recommended) or locally.
  • Working with Gemma-2 (2B) is possible with ~15GB GPU RAM; larger models or batch sizes require more.
  • Official tutorial: demos/circuit_tracing_tutorial.ipynb
  • CLI usage example: circuit-tracer attribute --prompt "..." --transcoder_set gemma --slug gemma-demo --graph_file_dir ./graph_files --server

Highlighted Details

  • Supports Gemma-2 (2B) and Llama-3.2 (1B) models with provided transcoder sets.
  • Offers a web-based visualization interface for exploring attribution graphs.
  • Enables direct model interventions by manipulating transcoder features.
  • CLI for end-to-end circuit finding, pruning, and visualization.

Maintenance & Community

  • Developed by researchers from safety-research.
  • Citation details provided for academic use.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • Interventions are currently only supported when using the library via a script or notebook, not through the Neuronpedia interface.
  • The Llama demo is not supported on Colab.
  • Full support for custom transcoder configurations is noted as "coming soon."
Health Check
Last commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
5
Issues (30d)
3
Star History
2,222 stars in the last 90 days

Explore Similar Projects

Starred by Dominik Moritz Dominik Moritz(Professor at CMU; ML Researcher at Apple), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
2 more.

ecco by jalammar

0%
2k
Python library for interactive NLP model visualization in Jupyter notebooks
created 4 years ago
updated 11 months ago
Starred by Anastasios Angelopoulos Anastasios Angelopoulos(Cofounder of LMArena), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

transformer-debugger by openai

0.1%
4k
Tool for language model behavior investigation
created 1 year ago
updated 1 year ago
Feedback? Help us improve.