discovering_latent_knowledge by collin-burns

Research paper code for unsupervised discovery of latent knowledge in LLMs

Created 3 years ago

283 stars

Top 92.4% on SourcePulse

View on GitHub

2 Experts Love This Project

Vincent Weisser

Cofounder of Prime Intellect

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

This repository provides code for discovering truth-like features from language model activations in a purely unsupervised manner, addressing misalignment issues in standard training methods. It's targeted at researchers and practitioners interested in understanding and verifying the internal knowledge of LLMs, offering a method to extract factual information independent of model outputs.

How It Works

The core method, CCS (Contrastive Causal Supervision), identifies a direction in activation space that satisfies logical consistency properties. For instance, it ensures that a statement and its negation yield opposite truth values. This approach leverages internal model states, bypassing the need for explicit supervision or model outputs, and is advantageous for its ability to uncover knowledge even when models are prompted to generate incorrect information.

Quick Start & Requirements

Install: pip install -r requirements.txt (requirements.txt not provided in README, but implied by usage).
Prerequisites: Python 3.7.5, PyTorch 1.12, datasets, promptsource.
Usage:
- Generate activations: python generate.py --model_name <model> --num_examples <num> --batch_size <size>
- Evaluate activations: python evaluate.py --model_name <model> --num_examples <num> --batch_size <size>
Links: CCS.ipynb notebook for a simplified walkthrough.

Highlighted Details

Outperforms zero-shot accuracy by 4% on average across 6 models and 10 datasets.
Reduces prompt sensitivity by half.
Maintains high accuracy even when models are prompted to generate incorrect answers.
Allows extraction of knowledge from model activations without supervision or model outputs.

Maintenance & Community

The README notes that the project recommends using EleutherAI/elk, an improved codebase. No other community or maintenance details are provided.

Licensing & Compatibility

The license is not explicitly stated in the README.

Limitations & Caveats

The project recommends a different, improved codebase (EleutherAI/elk), suggesting this repository may be outdated or less maintained. The requirements.txt file is not directly provided, requiring manual installation of dependencies.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days