Code and datasets for automated interpretability research
Top 37.2% on sourcepulse
This repository provides code and tools for the automated interpretation of neurons within language models, specifically targeting researchers and practitioners interested in understanding model behavior. It enables the generation, simulation, and scoring of explanations for individual neuron activations, facilitating deeper insights into how large language models function.
How It Works
The project implements a methodology for explaining neuron behavior by simulating neuron activations across various input tokens and scoring these simulations. It leverages pre-computed datasets of GPT-2 XL and GPT-2 Small neuron activations and explanations, accessible via Azure Blob Storage. The core approach involves analyzing connection weights and activation patterns to identify influential tokens and related neurons, providing a structured way to probe and understand the function of specific model components.
Quick Start & Requirements
neuron-explainer
and neuron-viewer
.Highlighted Details
neuron-viewer
).Maintenance & Community
The project is associated with OpenAI. Specific community channels or active maintenance signals are not detailed in the provided README.
Licensing & Compatibility
The licensing information is not explicitly stated in the provided README.
Limitations & Caveats
A bug in the GELU implementation used for GPT-2 series inference was discovered, leading to minor discrepancies in post-MLP activation values compared to the original implementation. The methodology for GPT-2 Small differs from GPT-2 XL, making direct comparison of results challenging.
1 year ago
1+ week