discovering_latent_knowledge  by collin-burns

Research paper code for unsupervised discovery of latent knowledge in LLMs

created 2 years ago
274 stars

Top 95.2% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides code for discovering truth-like features from language model activations in a purely unsupervised manner, addressing misalignment issues in standard training methods. It's targeted at researchers and practitioners interested in understanding and verifying the internal knowledge of LLMs, offering a method to extract factual information independent of model outputs.

How It Works

The core method, CCS (Contrastive Causal Supervision), identifies a direction in activation space that satisfies logical consistency properties. For instance, it ensures that a statement and its negation yield opposite truth values. This approach leverages internal model states, bypassing the need for explicit supervision or model outputs, and is advantageous for its ability to uncover knowledge even when models are prompted to generate incorrect information.

Quick Start & Requirements

  • Install: pip install -r requirements.txt (requirements.txt not provided in README, but implied by usage).
  • Prerequisites: Python 3.7.5, PyTorch 1.12, datasets, promptsource.
  • Usage:
    • Generate activations: python generate.py --model_name <model> --num_examples <num> --batch_size <size>
    • Evaluate activations: python evaluate.py --model_name <model> --num_examples <num> --batch_size <size>
  • Links: CCS.ipynb notebook for a simplified walkthrough.

Highlighted Details

  • Outperforms zero-shot accuracy by 4% on average across 6 models and 10 datasets.
  • Reduces prompt sensitivity by half.
  • Maintains high accuracy even when models are prompted to generate incorrect answers.
  • Allows extraction of knowledge from model activations without supervision or model outputs.

Maintenance & Community

The README notes that the project recommends using EleutherAI/elk, an improved codebase. No other community or maintenance details are provided.

Licensing & Compatibility

The license is not explicitly stated in the README.

Limitations & Caveats

The project recommends a different, improved codebase (EleutherAI/elk), suggesting this repository may be outdated or less maintained. The requirements.txt file is not directly provided, requiring manual installation of dependencies.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Starred by Anastasios Angelopoulos Anastasios Angelopoulos(Cofounder of LMArena), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

transformer-debugger by openai

0.1%
4k
Tool for language model behavior investigation
created 1 year ago
updated 1 year ago
Feedback? Help us improve.