persona_vectors  by safety-research

LLM trait control and monitoring framework

Created 7 months ago
371 stars

Top 76.7% on SourcePulse

GitHubView on GitHub
Project Summary

Persona Vectors provides a method for monitoring and controlling specific character traits within large language models. Aimed at researchers and developers, it offers a mechanism to imbue or suppress traits like "evil" or "helpful" through targeted vector manipulation, enhancing LLM controllability.

How It Works

The core approach involves generating "persona vectors" by calculating the mean difference in model activations between positive and negative prompts associated with a target trait. These vectors, representing the trait's influence, can then be applied during inference-time steering or integrated into the training process for preventative control. This allows for fine-grained behavioral modification of LLMs.

Quick Start & Requirements

  • Installation: Requires setting up a Python virtual environment, installing dependencies via pip install -r requirements.txt, and configuring API keys in a .env file.
  • Prerequisites: Python, requirements.txt, API keys (e.g., OpenAI, Anthropic for evaluation/generation), and a GPU are necessary for most operations.
  • Usage: Key scripts include generate_vec.py for vector computation, eval.eval_persona for evaluation and inference-time steering, and training.py for model training with or without steering.
  • Resources: GPU is mandatory for evaluation and training.
  • Links: No external links for docs or demos are provided.

Highlighted Details

  • Enables precise monitoring and control of LLM character traits.
  • Supports both inference-time steering and training-time preventative measures.
  • Leverages OpenAI-based models for robust evaluation.
  • Generates distinct vectors for prompt, response, and all token activations.

Maintenance & Community

The provided README does not contain information regarding maintainers, community channels (e.g., Discord, Slack), or project roadmaps.

Licensing & Compatibility

The repository's license is not specified in the README, making commercial use or integration decisions difficult without further clarification.

Limitations & Caveats

The project relies heavily on external API services for artifact generation and evaluation, incurring potential costs and external dependencies. GPU hardware is a strict requirement for core functionalities. The absence of a specified license is a significant adoption blocker.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
0
Star History
18 stars in the last 30 days

Explore Similar Projects

Starred by Anastasios Angelopoulos Anastasios Angelopoulos(Cofounder of LMArena), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

transformer-debugger by openai

0.1%
4k
Tool for language model behavior investigation
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Didier Lopes Didier Lopes(Founder of OpenBB), and
3 more.

instructlab by instructlab

0.1%
1k
CLI tool for LLM alignment tuning via synthetic data
Created 2 years ago
Updated 3 weeks ago
Feedback? Help us improve.