natural_language_autoencoders by kitft

Explain LLM activations with natural language autoencoders

Created 2 months ago

893 stars

Top 39.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Didier Lopes

Founder of OpenBB

Project Summary

Summary

This repository provides Natural Language Autoencoders (NLA), an open-source library for generating unsupervised explanations of LLM activations. By mapping activation vectors to natural language and back, NLA offers researchers and engineers a tool for understanding internal LLM mechanisms and the semantic content captured by model activations.

How It Works

NLAs comprise an Activation Verbalizer (AV) mapping vectors to text and an Activation Reconstructor (AR) mapping text back to vectors. The AV injects the activation vector as a token embedding into a prompt and autoregresses a description. The AR uses a truncated LM to recover the vector from text. L2-normalized vectors are used, with round-trip Mean Squared Error (MSE) quantifying explanation quality via directional agreement.

Quick Start & Requirements

For inference, install torch transformers safetensors httpx orjson pyyaml numpy "sglang[all]>=0.5.6". Launch SGLang server (python -m sglang.launch_server --model-path <model_path> --port 30000 --disable-radix-cache &) then run inference (python nla_inference.py <model_path> --sglang-url http://localhost:30000 --parquet path/to/activations.parquet). Training requires substantial GPU resources (e.g., multiple H100s) and involves data generation, SFT, and RL stages detailed in configs/TRAINING_NOTES.md. Inference docs are in docs/inference.md.

Highlighted Details

Released checkpoints cover Qwen2.5-7B, Gemma-3-12B/27B, and Llama-3.3-70B, extracting activations from mid-to-deep model layers.
Training leverages near-frontier infrastructure: Miles for Ray-orchestrated RL (FSDP2/Megatron, GRPO) and SGLang for serving activations via input_embeds.
The NLA package integrates cleanly with Miles and SGLang via extension points and hooks, enabling seamless upstream updates.
Includes a 4-stage data generation pipeline for creating activation datasets.

Maintenance & Community

The project lists multiple authors in its academic citation. No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The core library is Apache-2.0 licensed, permissive for commercial use. Released checkpoints inherit base model licenses (Gemma, Llama-3.3), which may impose additional restrictions. Users must consult base model NOTICE files.

Limitations & Caveats

Reproducing checkpoints demands significant computational resources (e.g., 8x H100s for RL). Inference relies on specific SGLang configurations. Users must verify base LLM license terms for commercial deployment compatibility.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

107 stars in the last 30 days