SAELens by decoderesearch

Mechanistic interpretability tools for language models

Created 2 years ago

1,220 stars

Top 31.9% on SourcePulse

Project Summary

Summary SAELens is a Python library for researchers in mechanistic interpretability and AI safety. It enables training and analysis of Sparse Autoencoders (SAEs) in language models, aiming to generate insights for safer, more aligned AI systems. The library supports various PyTorch-based models.

How It Works SAELens integrates with PyTorch models for SAE training and analysis. It offers deep integration with TransformerLens via HookedSAETransformer but also supports Hugging Face Transformers, NNsight, and other frameworks. This is achieved by extracting activations and using the SAE's encode() and decode() methods, providing versatile SAE research capabilities across different environments.

Quick Start & Requirements Requires a PyTorch environment. Specific installation commands are absent; users are directed to documentation for downloading pre-trained SAEs, training custom ones, and using the SAE-Vis library for feature dashboards. Tutorials cover loading/analyzing pre-trained SAEs, feature understanding with Logit Lens, and training on synthetic data.

Highlighted Details

Facilitates training and analysis of Sparse Autoencoders for mechanistic interpretability.
Integrates deeply with TransformerLens and supports Hugging Face Transformers, NNsight, and other PyTorch frameworks.
Provides access to pre-trained SAEs and visualization tools via the SAE-Vis library.
Driven by AI safety research goals.

Maintenance & Community Maintained by Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin. Support is available via the Open Source Mechanistic Interpretability Slack. Related projects include dictionary-learning, Sparsify, Overcomplete, SAE-Vis, and SAEBench.

Licensing & Compatibility The specific open-source license is not stated in the README. Commercial use or closed-source linking compatibility is undetermined.

Limitations & Caveats The recent v6 update introduced a major refactor impacting training code structure. Users migrating should consult the migration guide, indicating potential breaking changes or a learning curve.

SAELens by decoderesearch

Explore Similar Projects

Awesome-LLM-Interpretability by cooperleong00

Awesome-Interpretability-in-Large-Language-Models by ruizheliUOA

klarity by klara-research

llama3_interpretability_sae by PaulPauls

sparse_coding by HoagyC

xplique by deel-ai

Quantus by understandable-machine-intelligence-lab

awesome-llm-interpretability by JShollaj

transformer-debugger by openai

lit by PAIR-code

TransformerLens by TransformerLensOrg

captum by meta-pytorch