TransformerLens by TransformerLensOrg

Library for mechanistic interpretability research on GPT-style language models

Created 3 years ago

2,965 stars

Top 16.0% on SourcePulse

View on GitHub

3 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Neel Nanda

Research Scientist at Google DeepMind

Evan Hubinger

Head of Alignment Stress-Testing at Anthropic

Project Summary

TransformerLens is a Python library designed for mechanistic interpretability of GPT-style language models. It empowers researchers and practitioners to reverse-engineer the internal algorithms learned by these models by providing access to and manipulation of intermediate activations. The library facilitates in-depth analysis of model behavior, enabling a deeper understanding of how LLMs function.

How It Works

TransformerLens operates by allowing users to load various pre-trained transformer models and attach "hooks" to specific layers or components. These hooks can cache, modify, or replace activations as the model processes input. This fine-grained control over internal states enables techniques like activation patching and direct logit attribution, crucial for dissecting model computations and identifying the neural mechanisms responsible for specific behaviors.

Quick Start & Requirements

Primary install: pip install transformer_lens
Requirements: Python, PyTorch. Supports loading over 50 open-source language models.
Resources: Can be run on a single GPU or even CPU for smaller models, with many tutorials available in Colab notebooks.
Links: Introduction to the Library, Demos

Highlighted Details

Facilitates mechanistic interpretability research, with several papers published using the library.
Supports caching and editing of internal model activations via a hook system.
Offers a wide range of tutorials and examples for learning interpretability techniques.
Integrates with various open-source LLMs, including GPT-2 variants.

Maintenance & Community

Created by Neel Nanda, maintained by Bryce Meyer.
Active community on Slack for discussions and contributions.
Slack Community

Licensing & Compatibility

MIT License. Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Primarily focused on GPT-style architectures; support for other architectures may be limited.
The field of mechanistic interpretability is nascent, with ongoing development and potential for breaking changes.

Health Check

Last Commit

2 days ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

102 stars in the last 30 days