persona_vectors by safety-research

LLM trait control and monitoring framework

Created 10 months ago

437 stars

Top 67.7% on SourcePulse

Project Summary

Persona Vectors provides a method for monitoring and controlling specific character traits within large language models. Aimed at researchers and developers, it offers a mechanism to imbue or suppress traits like "evil" or "helpful" through targeted vector manipulation, enhancing LLM controllability.

How It Works

The core approach involves generating "persona vectors" by calculating the mean difference in model activations between positive and negative prompts associated with a target trait. These vectors, representing the trait's influence, can then be applied during inference-time steering or integrated into the training process for preventative control. This allows for fine-grained behavioral modification of LLMs.

Quick Start & Requirements

Installation: Requires setting up a Python virtual environment, installing dependencies via pip install -r requirements.txt, and configuring API keys in a .env file.
Prerequisites: Python, requirements.txt, API keys (e.g., OpenAI, Anthropic for evaluation/generation), and a GPU are necessary for most operations.
Usage: Key scripts include generate_vec.py for vector computation, eval.eval_persona for evaluation and inference-time steering, and training.py for model training with or without steering.
Resources: GPU is mandatory for evaluation and training.
Links: No external links for docs or demos are provided.

Highlighted Details

Enables precise monitoring and control of LLM character traits.
Supports both inference-time steering and training-time preventative measures.
Leverages OpenAI-based models for robust evaluation.
Generates distinct vectors for prompt, response, and all token activations.

Maintenance & Community

The provided README does not contain information regarding maintainers, community channels (e.g., Discord, Slack), or project roadmaps.

Licensing & Compatibility

The repository's license is not specified in the README, making commercial use or integration decisions difficult without further clarification.

Limitations & Caveats

The project relies heavily on external API services for artifact generation and evaluation, incurring potential costs and external dependencies. GPU hardware is a strict requirement for core functionalities. The absence of a specified license is a significant adoption blocker.

persona_vectors by safety-research

Explore Similar Projects

oat-zero by sail-sg

ErisForge by Tsadoq

Machine-Mindset by PKU-YuanGroup

LLMmap by pasquini-dario

trainable-agents by choosewhatulike

CoLLiE by OpenMOSS

build_MiniLLM_from_scratch by Tongjilibo

repeng by vgel

evolving_personality by agent-topia

evals-skills by hamelsmu

transformer-debugger by openai

instructlab by instructlab