TransformerLens  by TransformerLensOrg

Library for mechanistic interpretability research on GPT-style language models

Created 3 years ago
2,593 stars

Top 18.1% on SourcePulse

GitHubView on GitHub
Project Summary

TransformerLens is a Python library designed for mechanistic interpretability of GPT-style language models. It empowers researchers and practitioners to reverse-engineer the internal algorithms learned by these models by providing access to and manipulation of intermediate activations. The library facilitates in-depth analysis of model behavior, enabling a deeper understanding of how LLMs function.

How It Works

TransformerLens operates by allowing users to load various pre-trained transformer models and attach "hooks" to specific layers or components. These hooks can cache, modify, or replace activations as the model processes input. This fine-grained control over internal states enables techniques like activation patching and direct logit attribution, crucial for dissecting model computations and identifying the neural mechanisms responsible for specific behaviors.

Quick Start & Requirements

  • Primary install: pip install transformer_lens
  • Requirements: Python, PyTorch. Supports loading over 50 open-source language models.
  • Resources: Can be run on a single GPU or even CPU for smaller models, with many tutorials available in Colab notebooks.
  • Links: Introduction to the Library, Demos

Highlighted Details

  • Facilitates mechanistic interpretability research, with several papers published using the library.
  • Supports caching and editing of internal model activations via a hook system.
  • Offers a wide range of tutorials and examples for learning interpretability techniques.
  • Integrates with various open-source LLMs, including GPT-2 variants.

Maintenance & Community

  • Created by Neel Nanda, maintained by Bryce Meyer.
  • Active community on Slack for discussions and contributions.
  • Slack Community

Licensing & Compatibility

  • MIT License. Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

  • Primarily focused on GPT-style architectures; support for other architectures may be limited.
  • The field of mechanistic interpretability is nascent, with ongoing development and potential for breaking changes.
Health Check
Last Commit

1 day ago

Responsiveness

1 week

Pull Requests (30d)
36
Issues (30d)
4
Star History
115 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
4 more.

automated-interpretability by openai

0.1%
1k
Code and datasets for automated interpretability research
Created 2 years ago
Updated 1 year ago
Starred by Anastasios Angelopoulos Anastasios Angelopoulos(Cofounder of LMArena), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

transformer-debugger by openai

0.1%
4k
Tool for language model behavior investigation
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.