Discover and explore top open-source AI tools and projects—updated daily.
safety-researchLLM trait control and monitoring framework
Top 76.7% on SourcePulse
Persona Vectors provides a method for monitoring and controlling specific character traits within large language models. Aimed at researchers and developers, it offers a mechanism to imbue or suppress traits like "evil" or "helpful" through targeted vector manipulation, enhancing LLM controllability.
How It Works
The core approach involves generating "persona vectors" by calculating the mean difference in model activations between positive and negative prompts associated with a target trait. These vectors, representing the trait's influence, can then be applied during inference-time steering or integrated into the training process for preventative control. This allows for fine-grained behavioral modification of LLMs.
Quick Start & Requirements
pip install -r requirements.txt, and configuring API keys in a .env file.requirements.txt, API keys (e.g., OpenAI, Anthropic for evaluation/generation), and a GPU are necessary for most operations.generate_vec.py for vector computation, eval.eval_persona for evaluation and inference-time steering, and training.py for model training with or without steering.Highlighted Details
Maintenance & Community
The provided README does not contain information regarding maintainers, community channels (e.g., Discord, Slack), or project roadmaps.
Licensing & Compatibility
The repository's license is not specified in the README, making commercial use or integration decisions difficult without further clarification.
Limitations & Caveats
The project relies heavily on external API services for artifact generation and evaluation, incurring potential costs and external dependencies. GPU hardware is a strict requirement for core functionalities. The absence of a specified license is a significant adoption blocker.
7 months ago
Inactive
vgel
openai
instructlab