SAELens  by decoderesearch

Mechanistic interpretability tools for language models

Created 2 years ago
1,150 stars

Top 33.4% on SourcePulse

GitHubView on GitHub
Project Summary

Summary SAELens is a Python library for researchers in mechanistic interpretability and AI safety. It enables training and analysis of Sparse Autoencoders (SAEs) in language models, aiming to generate insights for safer, more aligned AI systems. The library supports various PyTorch-based models.

How It Works SAELens integrates with PyTorch models for SAE training and analysis. It offers deep integration with TransformerLens via HookedSAETransformer but also supports Hugging Face Transformers, NNsight, and other frameworks. This is achieved by extracting activations and using the SAE's encode() and decode() methods, providing versatile SAE research capabilities across different environments.

Quick Start & Requirements Requires a PyTorch environment. Specific installation commands are absent; users are directed to documentation for downloading pre-trained SAEs, training custom ones, and using the SAE-Vis library for feature dashboards. Tutorials cover loading/analyzing pre-trained SAEs, feature understanding with Logit Lens, and training on synthetic data.

Highlighted Details

  • Facilitates training and analysis of Sparse Autoencoders for mechanistic interpretability.
  • Integrates deeply with TransformerLens and supports Hugging Face Transformers, NNsight, and other PyTorch frameworks.
  • Provides access to pre-trained SAEs and visualization tools via the SAE-Vis library.
  • Driven by AI safety research goals.

Maintenance & Community Maintained by Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin. Support is available via the Open Source Mechanistic Interpretability Slack. Related projects include dictionary-learning, Sparsify, Overcomplete, SAE-Vis, and SAEBench.

Licensing & Compatibility The specific open-source license is not stated in the README. Commercial use or closed-source linking compatibility is undetermined.

Limitations & Caveats The recent v6 update introduced a major refactor impacting training code structure. Users migrating should consult the migration guide, indicating potential breaking changes or a learning curve.

Health Check
Last Commit

23 hours ago

Responsiveness

Inactive

Pull Requests (30d)
27
Issues (30d)
1
Star History
50 stars in the last 30 days

Explore Similar Projects

Starred by Anastasios Angelopoulos Anastasios Angelopoulos(Cofounder of LMArena), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

transformer-debugger by openai

0%
4k
Tool for language model behavior investigation
Created 1 year ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Gabriel Almeida Gabriel Almeida(Cofounder of Langflow), and
5 more.

lit by PAIR-code

0.2%
4k
Interactive ML model analysis tool for understanding model behavior
Created 5 years ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Neel Nanda Neel Nanda(Research Scientist at Google DeepMind), and
1 more.

TransformerLens by TransformerLensOrg

0.8%
3k
Library for mechanistic interpretability research on GPT-style language models
Created 3 years ago
Updated 3 days ago
Feedback? Help us improve.