natural_language_autoencoders  by kitft

Explain LLM activations with natural language autoencoders

Created 3 weeks ago

New!

703 stars

Top 48.2% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This repository provides Natural Language Autoencoders (NLA), an open-source library for generating unsupervised explanations of LLM activations. By mapping activation vectors to natural language and back, NLA offers researchers and engineers a tool for understanding internal LLM mechanisms and the semantic content captured by model activations.

How It Works

NLAs comprise an Activation Verbalizer (AV) mapping vectors to text and an Activation Reconstructor (AR) mapping text back to vectors. The AV injects the activation vector as a token embedding into a prompt and autoregresses a description. The AR uses a truncated LM to recover the vector from text. L2-normalized vectors are used, with round-trip Mean Squared Error (MSE) quantifying explanation quality via directional agreement.

Quick Start & Requirements

For inference, install torch transformers safetensors httpx orjson pyyaml numpy "sglang[all]>=0.5.6". Launch SGLang server (python -m sglang.launch_server --model-path <model_path> --port 30000 --disable-radix-cache &) then run inference (python nla_inference.py <model_path> --sglang-url http://localhost:30000 --parquet path/to/activations.parquet). Training requires substantial GPU resources (e.g., multiple H100s) and involves data generation, SFT, and RL stages detailed in configs/TRAINING_NOTES.md. Inference docs are in docs/inference.md.

Highlighted Details

  • Released checkpoints cover Qwen2.5-7B, Gemma-3-12B/27B, and Llama-3.3-70B, extracting activations from mid-to-deep model layers.
  • Training leverages near-frontier infrastructure: Miles for Ray-orchestrated RL (FSDP2/Megatron, GRPO) and SGLang for serving activations via input_embeds.
  • The NLA package integrates cleanly with Miles and SGLang via extension points and hooks, enabling seamless upstream updates.
  • Includes a 4-stage data generation pipeline for creating activation datasets.

Maintenance & Community

The project lists multiple authors in its academic citation. No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The core library is Apache-2.0 licensed, permissive for commercial use. Released checkpoints inherit base model licenses (Gemma, Llama-3.3), which may impose additional restrictions. Users must consult base model NOTICE files.

Limitations & Caveats

Reproducing checkpoints demands significant computational resources (e.g., 8x H100s for RL). Inference relies on specific SGLang configurations. Users must verify base LLM license terms for commercial deployment compatibility.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
706 stars in the last 22 days

Explore Similar Projects

Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
19 more.

lectures by oxford-cs-deepnlp-2017

0.0%
16k
NLP course (lecture slides) for deep learning approaches to language
Created 9 years ago
Updated 2 years ago
Feedback? Help us improve.