Discover and explore top open-source AI tools and projects—updated daily.
decoderesearchMechanistic interpretability tools for language models
Top 33.4% on SourcePulse
Summary SAELens is a Python library for researchers in mechanistic interpretability and AI safety. It enables training and analysis of Sparse Autoencoders (SAEs) in language models, aiming to generate insights for safer, more aligned AI systems. The library supports various PyTorch-based models.
How It Works
SAELens integrates with PyTorch models for SAE training and analysis. It offers deep integration with TransformerLens via HookedSAETransformer but also supports Hugging Face Transformers, NNsight, and other frameworks. This is achieved by extracting activations and using the SAE's encode() and decode() methods, providing versatile SAE research capabilities across different environments.
Quick Start & Requirements Requires a PyTorch environment. Specific installation commands are absent; users are directed to documentation for downloading pre-trained SAEs, training custom ones, and using the SAE-Vis library for feature dashboards. Tutorials cover loading/analyzing pre-trained SAEs, feature understanding with Logit Lens, and training on synthetic data.
Highlighted Details
Maintenance & Community
Maintained by Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin. Support is available via the Open Source Mechanistic Interpretability Slack. Related projects include dictionary-learning, Sparsify, Overcomplete, SAE-Vis, and SAEBench.
Licensing & Compatibility The specific open-source license is not stated in the README. Commercial use or closed-source linking compatibility is undetermined.
Limitations & Caveats The recent v6 update introduced a major refactor impacting training code structure. Users migrating should consult the migration guide, indicating potential breaking changes or a learning curve.
23 hours ago
Inactive
openai
PAIR-code
TransformerLensOrg
meta-pytorch