Research paper code for unsupervised discovery of latent knowledge in LLMs
Top 95.2% on sourcepulse
This repository provides code for discovering truth-like features from language model activations in a purely unsupervised manner, addressing misalignment issues in standard training methods. It's targeted at researchers and practitioners interested in understanding and verifying the internal knowledge of LLMs, offering a method to extract factual information independent of model outputs.
How It Works
The core method, CCS (Contrastive Causal Supervision), identifies a direction in activation space that satisfies logical consistency properties. For instance, it ensures that a statement and its negation yield opposite truth values. This approach leverages internal model states, bypassing the need for explicit supervision or model outputs, and is advantageous for its ability to uncover knowledge even when models are prompted to generate incorrect information.
Quick Start & Requirements
pip install -r requirements.txt
(requirements.txt not provided in README, but implied by usage).datasets
, promptsource
.python generate.py --model_name <model> --num_examples <num> --batch_size <size>
python evaluate.py --model_name <model> --num_examples <num> --batch_size <size>
Highlighted Details
Maintenance & Community
The README notes that the project recommends using EleutherAI/elk, an improved codebase. No other community or maintenance details are provided.
Licensing & Compatibility
The license is not explicitly stated in the README.
Limitations & Caveats
The project recommends a different, improved codebase (EleutherAI/elk), suggesting this repository may be outdated or less maintained. The requirements.txt
file is not directly provided, requiring manual installation of dependencies.
1 year ago
Inactive