wisent-guard by wisent-ai

Python package for latent space monitoring and guardrails

created 4 months ago

320 stars

Top 86.0% on sourcepulse

Project Summary

Wisent-Guard is a Python package for monitoring and controlling AI model activations, targeting developers and researchers seeking to mitigate harmful outputs and hallucinations. It offers a self-hosted, open-source alternative to traditional guardrails by analyzing internal model representations, providing deeper insights and more robust safety measures.

How It Works

Wisent-Guard employs a representation engineering approach, using contrastive pairs of "harmful" vs. "harmless" phrase activations to identify undesirable model behavior. It trains classifiers on these activation patterns, allowing for real-time monitoring during inference. This method aims to detect out-of-distribution harmful content and hallucinations by analyzing the model's internal "thoughts," rather than just the final output.

Quick Start & Requirements

Install: pip install wisent-guard
Prerequisites: Python, Hugging Face Transformers models. Apple Silicon (MPS) support is available.
Setup: Requires loading a Hugging Face model and tokenizer. Training a classifier involves providing phrase pairs.
Docs: Examples folder provide detailed usage.

Highlighted Details

Achieves a 43% hallucination rate reduction on Llama 3.1 8B for TruthfulQA.
Model-agnostic, supporting most transformer-based language models.
Features include customizable thresholds, layer selection, real-time monitoring, and response logging.
Offers early termination with customizable placeholder messages.

Maintenance & Community

Developed by Lukasz Bartoszcze.
Contributions are welcome via Pull Requests.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Wisent-Guard is described as experimental technology requiring careful hyperparameter tuning (model tokens, activation layers) for specific use cases. Latency and compute can be concerns, though support is offered for optimization.

wisent-guard by wisent-ai

Explore Similar Projects

Agent-FLAN by InternLM

klarity by klara-research

Intuitor by sunblaze-ucb

dolphin-system-messages by QuixiAI

Moxin-LLM by moxin-org

abliterator by FailSpy

bert-loves-chemistry by seyonechithrananda

detoxify by unitaryai

ecco by jalammar

TransformerLens by TransformerLensOrg

InternLM by InternLM

autotrain-advanced by huggingface